CS 267: Applications of Parallel Computers Structured Grids Kathy Yelick www.cs.berkeley.edu/~yelick/cs267_sp07 02/28/2007 CS267 Lecture 13

Download Report

Transcript CS 267: Applications of Parallel Computers Structured Grids Kathy Yelick www.cs.berkeley.edu/~yelick/cs267_sp07 02/28/2007 CS267 Lecture 13

CS 267: Applications of Parallel Computers
Structured Grids
Kathy Yelick
www.cs.berkeley.edu/~yelick/cs267_sp07
02/28/2007
CS267 Lecture 13
Outline
° Review PDEs
° Overview of Methods for Poisson Equation
° Jacobi’s method
° Red-Black SOR method
° Conjugate Gradient
° Optimizing Stencils for Memory Hierarchies
° Parallel Stencil Computations
° FFT (future lecture)
° Multigrid (future lecture)
02/28/2007
CS267 Lecture 13
Solving PDEs
° Hyperbolic problems (waves):
• Sound wave(position, time)
• Use explicit time-stepping
• Solution at each point depends on neighbors at previous time
° Elliptic (steady state) problems:
• Electrostatic Potential (position)
• Everything depends on everything else
• This means locality is harder to find than in hyperbolic problems
° Parabolic (time-dependent) problems:
• Temperature(position,time)
• Involves an elliptic solve at each time-step
° Focus on elliptic problems
• Canonical example is the Poisson equation
2u/x2 + 2u/y2 + 2u/z2 = f(x,y,z)
02/28/2007
CS267 Lecture 13
Explicit Solution of the Poisson Equation
° Explicit methods often used for hyperbolic PDEs
° Computation corresponds to
• Matrix vector multiply
• Combine nearest neighbors on grid
° Use finite differences with u[j,i] as the temperature at
• time t= i* (i = 0,1,2,…) and
• position x = j*h (j=0,1,…,N=1/h)
• initial conditions on u[j,0]
• boundary conditions on u[0,i] and u[N,i]
° At each timestep i = 0,1,2,...
For j=0 to N
u[j,i+1]= z*u[j-1,i]+ (1-2*z)*u[j,i]
+ z*u[j+1,i]
where z =C*/h2
02/28/2007
i
i=5
i=4
i=3
i=2
i=1
i=0
CS267 Lecture 13
u[0,0] u[1,0] u[2,0] u[3,0] u[4,0] u[5,0]
j
Matrix View of Explicit Method for Heat
• u[j,i+1]= z*u[j-1,i]+ (1-2*z)*u[j,i] + z*u[j+1,i]
• u[ :, i+1] = T * u[ :, i] where T is tridiagonal:
1-2z
z
1-2z
z
T=
z
z
1-2z
z
z
1-2z
z
z
2
-1
-1
2
-1
-1
2
-1
-1
2
-1
-1
2
= I – z*L, L =
1-2z
Graph and “3 point stencil”
z
• L called Laplacian (in 1D)
1-2z
z
• For a 2D mesh (5 point stencil) the Laplacian is
pentadiagonal (5 nonzeros per row, along diagonals)
• For a 3D mesh there are 7 nonzeros per row
02/28/2007
CS267 Lecture 13
Algorithms for 2D (3D) Poisson Equation (N = n2 (n3) vars)
Algorithm
Serial
PRAM
Memory
#Procs
° Dense LU
N3
N
N2
N2
° Band LU
N2 (N7/3)
N
N3/2 (N5/3)
N (N4/3)
° Jacobi
N2 (N5/3)
N (N2/3)
N
N
° Explicit Inv.
N2
log N
N2
N2
° Conj.Gradients N3/2 (N4/3)
N1/2(1/3) *log N
N
N
° Red/Black SOR N3/2 (N4/3)
N1/2 (N1/3)
N
N
° Sparse LU
N3/2 (N2)
N1/2
N*log N (N4/3)
N
° FFT
N*log N
log N
N
N
° Multigrid
N
log2 N
N
N
log N
N
° Lower bound N
PRAM is an idealized parallel model with zero cost communication
Reference: James Demmel, Applied Numerical Linear Algebra, SIAM, 1997.
02/28/2007
CS267 Lecture 13
Overview of Algorithms on n x n grid (n x n x n grid) (1)
° Sorted in two orders (roughly):
• from slowest to fastest on sequential machines.
• from most general (works on any matrix) to most specialized (works on matrices
“like” T).
° Dense LU: Gaussian elimination; works on any N-by-N matrix.
° Band LU: Exploits the fact that T is nonzero only on N1/2 (N2/3) diagonals
nearest main diagonal.
° Jacobi: Essentially does matrix-vector multiply by T in inner loop of iterative
algorithm.
° Explicit Inverse: Assume we want to solve many systems with T, so we can
precompute and store inv(T) “for free”, and just multiply by it (but still
expensive!).
° Conjugate Gradient: Uses matrix-vector multiplication, like Jacobi, but exploits
mathematical properties of T that Jacobi does not.
° Red-Black SOR (successive over-relaxation): Variation of Jacobi that exploits
yet different mathematical properties of T. Used in multigrid schemes.
° Sparse LU: Gaussian elimination exploiting particular zero structure of T.
° FFT (fast Fourier transform): Works only on matrices very like T.
° Multigrid: Also works on matrices like T, that come from elliptic PDEs.
° Lower Bound: Serial (time to print answer); parallel (time to combine N inputs).
° Details in class notes and www.cs.berkeley.edu/~demmel/ma221.
02/28/2007
CS267 Lecture 13
Overview of Algorithms on n x n grid (n x n x n grid) (2)
°Dimension = N = n2 in 2D (=n3 in 3D)
°Bandwidth = BW = n in 2D (=n2 in 3D)
°Condition number = k = n2 in 2D or 3D
°#its = number of iterations
°SpMV = Sparse-matrix-vector-multiply
°Dense LU:
• cost = N3 (serial) or N (parallel, with N2 procs)
°Band LU:
• cost = N * BW2 (serial) or N (parallel with BW2 procs)
°Jacobi:
• cost = cost(SpMV) * #its, #its = k.
SpMV cost is cost
of nearest neighbor
relaxation, which is
O(N)
°Conjugate Gradient:
• cost = (cost(SpMV) + cost(dot prod))* #its, #its = sqrt(k)
°Red-Black SOR:
• cost = cost(SpMV) * #its, #its = sqrt(k)
°Iterative algorithms need different assumptions for
convergence analysis
02/28/2007
CS267 Lecture 13
Jacobi’s Method
° To derive Jacobi’s method, write Poisson as:
u(i,j) = (u(i-1,j) + u(i+1,j) + u(i,j-1) + u(i,j+1) + b(i,j))/4
° Let u(i,j,m) be approximation for u(i,j) after m steps
u(i,j,m+1) = (u(i-1,j,m) + u(i+1,j,m) + u(i,j-1,m) +
u(i,j+1,m) + b(i,j)) / 4
° I.e., u(i,j,m+1) is a weighted average of neighbors
° Motivation: u(i,j,m+1) chosen to exactly satisfy
equation at (i,j)
° Steps to converge proportional to problem size, N=n2
• See Lecture 24 of www.cs.berkeley.edu/~demmel/cs267_Spr99
° Therefore, serial complexity is O(N2)
02/28/2007
CS267 Lecture 13
Convergence of Nearest Neighbor Methods
° Jacobi’s method involves nearest neighbor
computation on nxn grid (N = n2)
• So it takes O(n) = O(sqrt(N)) iterations for information to propagate
° E.g., consider a rhs (b) that is 0, except the center is 1
° The exact solution looks like:
Even in the best case, any
nearest neighbor computation
will take n/2 steps to propagate
on an nxn grid
02/28/2007
CS267 Lecture 13
Convergence of Nearest Neighbor Methods
02/28/2007
CS267 Lecture 13
Improvements on Jacobi
° Similar to Jacobi: u(i,j,m+1) is computed as a linear
combination of neighbors
° Numeric coefficients and update order are different
° Based on 2 improvements over Jacobi
• Use “most recent values” of u that are available, since these are
probably more accurate
• Update value of u(m+1) “more aggressively” at each step
° First, note that while evaluating sequentially
• u(i,j,m+1) = (u(i-1,j,m) + u(i+1,j,m) …
some of the values for m+1 are already available
• u(i,j,m+1) = (u(i-1,j,latest) + u(i+1,j,latest) …
where latest is either m or m+1
02/28/2007
CS267 Lecture 13
Gauss-Seidel
° Updating left-to-right row-wise order, we get the
Gauss-Seidel algorithm
for i = 1 to n
for j = 1 to n
u(i,j,m+1) = (u(i-1,j,m+1) + u(i+1,j,m) + u(i,j-1,m+1) + u(i,j+1,m)
+ b(i,j)) / 4
° Cannot be parallelized, because of dependencies, so
instead we use a “red-black” order
forall black points u(i,j)
u(i,j,m+1) = (u(i-1,j,m) + …
forall red points u(i,j)
u(i,j,m+1) = (u(i-1,j,m+1) + …
° For general graph, use graph coloring
° Can use repeated Maximal Independent Sets to color
° Graph(T) is bipartite => 2 colorable (red and black)
° Nodes for each color can be updated simultaneously
° Still
02/28/2007
Sparse-matrix-vector
CS267multiply,
Lecture 13 using submatrices
Successive Overrelaxation (SOR)
° Red-black Gauss-Seidel converges twice as fast as Jacobi, but
there are twice as many parallel steps, so the same in practice
° To motivate next improvement, write basic step in algorithm as:
u(i,j,m+1) = u(i,j,m) + correction(i,j,m)
° If “correction” is a good direction to move, then one should move
even further in that direction by some factor w>1
u(i,j,m+1) = u(i,j,m) + w * correction(i,j,m)
° Called successive overrelaxation (SOR)
° Parallelizes like Jacobi (Still sparse-matrix-vector multiply…)
° Can prove w = 2/(1+sin(p/(n+1)) ) for best convergence
• Number of steps to converge = parallel complexity = O(n), instead of O(n2)
for Jacobi
• Serial complexity O(n3) = O(N3/2), instead of O(n4) = O(N2) for Jacobi
02/28/2007
CS267 Lecture 13
Conjugate Gradient (CG) for solving A*x = b
° This method can be used when the matrix A is
• symmetric, i.e., A = AT
• positive definite, defined equivalently as:
-
all eigenvalues are positive
-
xT * A * x > 0 for all nonzero vectors s
-
a Cholesky factorization, A = L*LT exists
° Algorithm maintains 3 vectors
• x = the approximate solution, improved after each iteration
• r = the residual, r = A*x - b
• p = search direction, also called the conjugate gradient
° One iteration costs
• Sparse-matrix-vector multiply by A (major cost)
• 3 dot products, 3 saxpys (scale*vector + vector)
° Converges in O(n) = O(N1/2) steps, like SOR
• Serial complexity = O(N3/2)
• Parallel complexity = O(N1/2 log N),
02/28/2007
log N factor from dot-products
CS267 Lecture 13
Conjugate Gradient Algorithm for Solving Ax=b
° Initial guess x
° r = b – A*x, j=1
° Repeat
• rho = rT*r … dot product
• If j=1, p = r, else beta = rho/old_rho, p = r + beta*p, endif … saxpy
• q = A*p … sparse matrix vector multiply
•
•
•
•
alpha = rho / pT * q … dot product
x = x + alpha * p … saxpy
r = r – alpha * q … saxpy
old_rho = rho; j=j+1
° Until rho small enough
02/28/2007
CS267 Lecture 13
Optimizations of Structure Grid Operations
• Sparse matrix-vector multiply (aka nearest
neighbor on a grid) is a critical component
• For arbitrary grids and matrices, we will
talk about optimizations in a future lecture
• For structured grids we can look at
• Memory system performance
• Blocking/titling as in matrix multiplication
• Parallel implementations and optimizations
02/28/2007
CS267 Lecture 13
Basic Stencil Code for 3D 7Point Stencil
void stencil3d(double A[], double B[],
int nx, int ny, int nz) {
for all i in x-dim {
for all j in y-dim {
for all k in z-dim {
B[center] = S0* A[i,j,k] +
S1*(A[i+1,j,k] + A[i-1,j,k] +
A[i,j+1,k] + A[i,j-1,k] +
A[i,j,k+1] + A[i,j,k-1]);
}
}
}
}
x
z (unit-stride)
y
02/28/2007
CS267 Lecture 13
3D (7pt) Stencils Achieve a Low Fraction of Peak…
Even on a single processor, performance (of the
% of Machine P eak for O ne Iteration of Naive Stencil Code
100%
I tanium2
O pteron
P ower5
90%
% of Machine Peak
80%
70%
60%
50%
40%
30%
20%
10%
0%
64
128
256
Cubic Grid Dimension
02/28/2007
CS267 Lecture 13
512
Tiling Stencil Computations
° Several papers on tiling stencil computations
• E.g., Rivera and Tseng SC2002, …
° Old Conventional Wisdom
• Cache misses are the most important factor
Cache misses in blocked/unblocked
02/28/2007
Time for unblocked/blocked
CS267 Lecture 13
Stanza Triad Results
° Perfect prefetching:
• performance is independent of L, the stanza length
• expect
flat line at STREAM peak
02/28/2007
CS267 Lecture 13
• our results show performance depends on L
Cost Model For Memory Cost
° Goal: model cost of memory operations for Stencils
• understand effect of prefetch on performance
° Start with model for Stanza Triad
° Simplified (2-point) model of a prefetch engine
•
•
•
•
any non-consecutive cache line is not prefetched
any consecutive cache line is prefetched
in reality, need more than one cache miss to enable prefetch
may have more than 2 points (step function based on size)
° First cache line in every L-element stanza is not prefetched
• assign cost Cnon-prefetched
• get value from Stanza Triad with L=cache line size
° The rest of the cache lines are prefetched
• assign cost Cprefetched
• value from Stanza Triad with large L
° Total Cost:
Cost = #non-prefetched * Cnon-prefetched +
#prefetched * Cprefetched
02/28/2007
CS267 Lecture 13
Stencil Cost Model
k-1 plane
k plane
° Lower Bound
• only load entire grid once & store once
• Memory traffic: 2 * N3
° Upper Bound
• only k plane is in cache, must reload other 2
• Memory traffic: 4 * N3
° Model built using parameters measured in STriad
02/28/2007
CS267 Lecture 13
k+1 plane
Stencil Probe Cost Model
02/28/2007
CS267 Lecture 13
Stencil Probe Cost Model
02/28/2007
CS267 Lecture 13
Stencil Probe Cost Model
02/28/2007
CS267 Lecture 13
Stencil PCache Blocking Speedups
New Conventional Wisdom: Prefetching is as important as caching
°
Little’s Law (Bailey ‘97): need data in-flight = latency * bandwidth
Cache blocking is useful for
1. large grid sizes: 3 planes do not fit in cache for 3D problem
2. do not cut/block the unit-stride dimension
Best Case Speedup f rom Rivera Blocking
256x32
512x64
512x16
256x16
02/28/2007
CS267 Lecture 13
G5 256
G5 128
Opteron
512
Opteron
256
Opteron
128
Itanium
512
Itanium
256
Itanium
128
1.8
1.6
1.4
1.2
1
0.8
0.6
0.4
0.2
0
We Are Still far from FP Peak
But are much closer (even without blocking) to memory peak
This data only counts 1 load of each mesh point (infinitely large cache)
Lower Bound on % of Peak Memory BW for O ne Iteration of
Naive Stencil Code
100%
I tanium2
O pteron
P ower5
% of Peak Memory BW
90%
80%
70%
60%
50%
40%
30%
20%
10%
0%
64
128
256
Cubic Grid Dimension
02/28/2007
CS267 Lecture 13
512
Blocking Over Time / Iterations
° Can we do better than this?
• Code is still severely limited by memory bandwidth
° For some computations, you can merge across
k sweeps over the grid
• Re-use data k times (as well as re-use within a plane)
° Dependencies produce pyramid patterns
t1
dx1
dx0
time
t0
Frigo & Strumpen, ICS05.
02/28/2007
x0
space
CS267 Lecture 13
x1
The Algorithm - Base Case
If the height is 1 (ie t1-t0=1) then we simply have a line
of points (t0,x) where x0 <= x <= x. Do the kernel on
this set of points. Order does not matter (no
interdependencies).
time
t1
t0
x1
02/28/2007
space
x1
CS267 Lecture 13
The Algorithm - Space Cut
° If the width <= 2*height, then cut with slope=-1
through the center.
t1
T2
time
T1
t0
x1
space
x1
° Do T1, then T2. No point in T1 depends on values
from T2.
02/28/2007
CS267 Lecture 13
The Algorithm - Time Cut
° Otherwise, cut trapezoid in half in the time
dimension.
t1
T2
time
T1
t0
x1
space
x1
° Do T1, then T2. No point in T1 depends on values of
T2.
02/28/2007
CS267 Lecture 13
Performance Issues
° Stencils have simple inner loops
• Usually about 1 Flop/load
• Reuse is limited to the number of points in the stencil
° Performance usually limited by memory
• For large meshes (e.g., 5123), cache blocking helps
• For smaller meshes, running time is roughly the time to read the
mesh once from main memory
• Tradeoff of blocking: reduces cache misses (bandwidth), but
increases prefetch misses (latency)
° Power5 has code generation problems
• Platform has aliasing issues
• Causes low percentage of both machine and bandwidth peak
02/28/2007
CS267 Lecture 13
Cache Blocking- Multiple Iterations
° Now we allow reuse across iterations
° Cache blocking now becomes slightly trickier
• Need to shift block after each iteration to respect dependencies
• Requires cache block dimension c as a parameter
• We call this “Cache Blocking with Time Skewing”
° Simple 3-point 1D stencil with 4 cache blocks shown
02/28/2007
CS267 Lecture 13
above
Cache Blocking with Time Skewing Performance
Gflop Rate for 4 Iterations of a 256^3 Problem
[Itanium 2]
2.5
GFlop Rate
2
Na ïve Una lia se d
Na ïve Alia se d
Be st T im e Sk e wing
1.5
1
0.5
0
Itanium 2
Opteron
Power 5
° Itanium 2 and Opteron both improve at least 20%
02/28/2007
CS267 Lecture 13
° Power5 has aliasing issues
G
O
O
D
Time Skewing - Optimal Block Size Search
G
O
O
D
02/28/2007
CS267 Lecture 13
Time Skewing - Optimal Block Size Search
G
O
O
D
° Reduced memory traffic
does correlate well to higher
CS267 Lecture 13
GFlop rates
02/28/2007
Performance Model
GOAL: Find optimal cache block size without exhaustive search
° Most important factors: memory traffic and prefetching
° First count the number of cache misses
• Inputs: cache size, cache line size, and grid size
• Model then classifies block sizes into 5 cases
• Misses are classified as either “fast” or “slow”
° Then predict performance by factoring in prefetching
• STriad microbenchmark determines cost of “fast” and “slow” misses
• Combine with cache miss model to compute running time
° This will eliminate poor block sizes, but does not choose best
• Best predicted block size may not actually be best
• Still need to do search over pruned parameter space
02/28/2007
CS267 Lecture 13
Memory Read Traffic Model
Overpredict
Underpredict
02/28/2007
CS267 Lecture 13
Performance Model
Overpredict
Underpredict
02/28/2007
CS267 Lecture 13
Parallelizing Jacobi’s Method
° Reduces to sparse-matrix-vector multiply by (nearly) T
U(m+1) = (T/4 - I) * U(m) + B/4
° Each value of U(m+1) may be updated independently
• keep 2 copies for iterations m and m+1
° Requires that boundary values be communicated
• if each processor owns n2/p elements to update
• amount of data communicated, n/p per neighbor, is relatively small
if n>>p
02/28/2007
CS267 Lecture 13
Locality Optimizations in Jacobi
° Nearest neighbor update in Jacobi has:
• Good spatial locality (for regular mesh): traverse rows/columns
• Poor temporal locality: few flops per mesh point
- E.g., on 2D mesh: 4 adds and 1 multiply for 5 loads and 1 store
° For both parallel and single processors, may trade
off extra computation for reduced memory
operations
• Idea: for each subgrid, do multiple Jacobi iterations
• Size of subgrid shrinks on each iteration, so start with larger one
• Used for uniprocessors:
- Rescheduling for Locality, Michelle Mills Strout:
– www-unix.mcs.anl.gov/~mstrout
- Cache-efficient multigrid algorithms, Sid Chatterjee
• And parallel machines, including “tuning” for grids
- Ian Foster et al, SC2001
02/28/2007
CS267 Lecture 13
Redundant Ghost Nodes in Jacobi
° Overview of Memory Hierarchy Optimization
To compute green
Copy yellow
Compute blue
° Can be used on unstructured meshes
° Size of ghost region (and redundant computation)
depends on network/memory speed vs. computation
02/28/2007
CS267 Lecture 13
Extra Slides
02/28/2007
CS267 Lecture 13
Templates for choosing an iterative algorithm for Ax=b
° Use only matrix-vector-multiplication, dot, saxpy, …
° www.netlib.org/templates
Symmetric?
N
Y
AT available?
N
N
Y
Storage
Expensive?
N
Definite?
Try QMR
Try Minres
or CG
Y
Try GMRES
02/28/2007
Y
Eigenvalues
Known?
N
Try CGS or
Bi-CGSTAB
Try CG
CS267 Lecture 13
Y
Try CG or
Chebyshev
Summary of Jacobi, SOR and CG
° Jacobi, SOR, and CG all perform sparse-matrix-vector
multiply
° For Poisson, this means nearest neighbor
communication on an n-by-n grid
° It takes n = N1/2 steps for information to travel across
an n-by-n grid
° Since solution on one side of grid depends on data on
other side of grid faster methods require faster ways
to move information
• FFT
• Multigrid
02/28/2007
CS267 Lecture 13
Solving the Poisson equation with the FFT
° Motivation: express continuous solution as Fourier series
• u(x,y) = Si Sk uik sin(p ix) sin(p ky)
• uik called Fourier coefficient of u(x,y)
° Poisson’s equation 2u/x2 + 2u/y2 = b becomes
Si Sk (-pi2 - pk2) uik sin(p ix) sin(p ky)
= Si Sk bik sin(p ix) sin(p ky)
° where bik are Fourier coefficients of b(x,y)
° By uniqueness of Fourier series, uik = bik
/ (-pi2 - pk2)
° Continuous Algorithm (Discrete Algorithm)
° Compute Fourier coefficient bik of right hand side
° Apply 2D FFT to values of b(i,k) on grid
° Compute Fourier coefficients uik of solution
° Divide each transformed b(i,k) by function(i,k)
° Compute solution u(x,y) from Fourier coefficients
values
02/28/2007° Apply 2D inverse FFT
CS267to
Lecture
13 of b(i,k)
Serial FFT
° Let i=sqrt(-1) and index matrices and vectors from 0.
° The Discrete Fourier Transform of an m-element
vector v is:
F*v
Where F is the m*m matrix defined as:
F[j,k] = v (j*k)
Where v is:
v = e (2pi/m) = cos(2p/m) + i*sin(2p/m)
 v is a complex number with whose mth power vm =1
and is therefore called an mth root of unity
° E.g., for m = 4:
v = i,
02/28/2007
v2 = -1,
v3 = -i,
CS267 Lecture 13
v4 = 1,
Using the 1D FFT for filtering
° Signal = sin(7t) + .5 sin(5t) at 128 points
° Noise = random number bounded by .75
° Filter by zeroing out FFT components < .25
02/28/2007
CS267 Lecture 13
Using the 2D FFT for image compression
° Image = 200x320 matrix of values
° Compress by keeping largest 2.5% of FFT
components
° Similar idea used by jpeg
02/28/2007
CS267 Lecture 13
Related Transforms
° Most applications require multiplication by both F
and inverse(F).
° Multiplying by F and inverse(F) are essentially the
same. (inverse(F) is the complex conjugate of F
divided by n.)
° For solving the Poisson equation and various other
applications, we use variations on the FFT
• The sin transform -- imaginary part of F
• The cos transform -- real part of F
° Algorithms are similar, so we will focus on the
forward FFT.
02/28/2007
CS267 Lecture 13
Serial Algorithm for the FFT
° Compute the FFT of an m-element vector v, F*v
(F*v)[j] = S k = 0 F(j,k) * v(k)
m-1
= S k = 0 v (j*k) * v(k)
m-1
= S k = 0 (v j)k * v(k)
m-1
= V(v j)
° Where V is defined as the polynomial
V(x) = S k = 0 xk * v(k)
m-1
02/28/2007
CS267 Lecture 13
Divide and Conquer FFT
° V can be evaluated using divide-and-conquer
V(x) = S
m-1
k=0
=
(x)k * v(k)
v[0] + x2*v[2] + x4*v[4] + …
+ x*(v[1] + x2*v[3] + x4*v[5] + … )
= Veven(x2) + x*Vodd(x2)
° V has degree m-1, so Veven and Vodd are polynomials
of degree m/2-1
° We evaluate these at points (v j)2 for 0<=j<=m-1
° But this is really just m/2 different points, since
(v (j+m/2) )2 = (v j *v m/2) )2 = v 2j *v m = (v j)2
° So FFT on m points reduced to 2 FFTs on m/2 points
• Divide and conquer!
02/28/2007
CS267 Lecture 13
Divide-and-Conquer FFT
FFT(v, v, m)
if m = 1 return v[0]
else
veven = FFT(v[0:2:m-2], v 2, m/2)
vodd = FFT(v[1:2:m-1], v 2, m/2)
precomputed
v-vec = [v0, v1, … v (m/2-1) ]
return [veven + (v-vec .* vodd),
veven - (v-vec .* vodd) ]
° The .* above is component-wise multiply.
° The […,…] is construction an m-element vector from 2 m/2
element vectors
This results in an O(m log m) algorithm.
02/28/2007
CS267 Lecture 13
An Iterative Algorithm
° The call tree of the d&c FFT algorithm is a complete
binary tree of log m levels
FFT(0,1,2,3,…,15) = FFT(xxxx)
even
odd
FFT(0,2,…,14) = FFT(xxx0)
FFT(xx00)
FFT(xx10)
FFT(1,3,…,15) = FFT(xxx1)
FFT(xx01)
FFT(xx11)
FFT(x000) FFT(x100) FFT(x010) FFT(x110) FFT(x001) FFT(x101) FFT(x011) FFT(x111)
FFT(0) FFT(8) FFT(4) FFT(12) FFT(2) FFT(10) FFT(6) FFT(14) FFT(1) FFT(9) FFT(5) FFT(13) FFT(3) FFT(11) FFT(7) FFT(15)
° Practical algorithms are iterative, going across each
level in the tree starting at the bottom
° Algorithm overwrites v[i] by (F*v)[bitreverse(i)]
02/28/2007
CS267 Lecture 13
Parallel 1D FFT
° Data dependencies in
1D FFT
• Butterfly pattern
° A PRAM algorithm
takes O(log m) time
• each step to right is
parallel
• there are log m steps
° What about
communication cost?
° See LogP paper for
details
02/28/2007
CS267 Lecture 13
Block Layout of 1D FFT
° Using a block layout
(m/p contiguous elts per
processor)
° No communication in
last log m/p steps
° Each step requires finegrained communication
in first log p steps
02/28/2007
CS267 Lecture 13
Cyclic Layout of 1D FFT
° Cyclic layout (only 1
element per
processor, wrapped)
° No communication
in first log(m/p)
steps
° Communication in
last log(p) steps
02/28/2007
CS267 Lecture 13
Parallel Complexity
° m = vector size, p = number of processors
° f = time per flop = 1
° a = startup for message (in f units)
° b = time per word in a message (in f units)
° Time(blockFFT) = Time(cyclicFFT) =
2*m*log(m)/p
+ log(p) * a
+ m*log(p)/p * b
02/28/2007
CS267 Lecture 13
FFT With “Transpose”
° If we start with a cyclic
layout for first log(p)
steps, there is no
communication
° Then transpose the
vector for last log(m/p)
steps
° All communication is
in the transpose
02/28/2007
CS267 Lecture 13
Why is the Communication Step Called a Transpose?
° Analogous to transposing an array
° View as a 2D array of n/p by p
° Note: same idea is useful for uniprocessor caches
02/28/2007
CS267 Lecture 13
Complexity of the FFT with Transpose
° If no communication is overlapped (overestimate!)
° Time(transposeFFT) =
2*m*log(m)/p
same as before
+ (p-1) * a
was log(p) * a
+ m*(p-1)/p2 * b
was m* log(p)/p * b
° If communication is overlapped, so we do not pay
for p-1 messages, the second term becomes simply
a, rather than (p-1)a.
° This is close to optimal. See LogP paper for details.
° See also following papers on class resource page
• A. Sahai, “Hiding Communication Costs in Bandwidth Limited FFT”
• R. Nishtala et al, “Optimizing bandwidth limited problems using onesided communication”
02/28/2007
CS267 Lecture 13
Comment on the 1D Parallel FFT
° The above algorithm leaves data in bit-reversed order
• Some applications can use it this way, like Poisson
• Others require another transpose-like operation
° Other parallel algorithms also exist
• A very different 1D FFT is due to Edelman (see http://wwwmath.mit.edu/~edelman)
• Based on the Fast Multipole algorithm
• Less communication for non-bit-reversed algorithm
02/28/2007
CS267 Lecture 13
Higher Dimension FFTs
° FFTs on 2 or 3 dimensions are define as 1D FFTs on
vectors in all dimensions.
° E.g., a 2D FFT does 1D FFTs on all rows and then all
columns
° There are 3 obvious possibilities for the 2D FFT:
• (1) 2D blocked layout for matrix, using 1D algorithms for each row
and column
• (2) Block row layout for matrix, using serial 1D FFTs on rows,
followed by a transpose, then more serial 1D FFTs
• (3) Block row layout for matrix, using serial 1D FFTs on rows,
followed by parallel 1D FFTs on columns
• Option 2 is best, if we overlap communication and computation
° For a 3D FFT the options are similar
• 2 phases done with serial FFTs, followed by a transpose for 3rd
• can overlap communication with 2nd phase in practice
02/28/2007
CS267 Lecture 13
FFTW – Fastest Fourier Transform in the West
° www.fftw.org
° Produces FFT implementation optimized for
•
•
•
•
Your version of FFT (complex, real,…)
Your value of n (arbitrary, possibly prime)
Your architecture
Close to optimal for serial, can be improved for parallel
° Similar in spirit to PHIPAC/ATLAS/Sparsity
° Won 1999 Wilkinson Prize for Numerical Software
° Widely used
02/28/2007
CS267 Lecture 13
Extra Slides
02/28/2007
CS267 Lecture 13
Poisson’s equation arises in many models
° Heat flow: Temperature(position, time)
° Diffusion: Concentration(position, time)
° Electrostatic or Gravitational Potential:
Potential(position)
° Fluid flow: Velocity,Pressure,Density(position,time)
° Quantum mechanics: Wave-function(position,time)
° Elasticity: Stress,Strain(position,time)
02/28/2007
CS267 Lecture 13
Poisson’s equation in 1D
T=
02/28/2007
2
-1
-1
2
-1
-1
2
-1
-1
2
-1
-1
2
Graph and “stencil”
-1
CS267 Lecture 13
2
-1
Algorithms for 2D Poisson Equation with N unknowns
Algorithm
Serial
PRAM
Memory
#Procs
° Dense LU
N3
N
N2
N2
° Band LU
N2
N
N3/2
N
° Jacobi
N2
N
N
N
° Explicit Inv.
N2
log N
N2
N2
° Conj.Grad.
N 3/2
N 1/2 *log N
N
N
° RB SOR
N 3/2
N 1/2
N
N
° Sparse LU
N 3/2
N 1/2
N*log N
N
° FFT
N*log N
log N
N
N
° Multigrid
N
log2 N
N
N
log N
N
° Lower bound N
PRAM is an idealized parallel model with zero cost communication
02/28/2007
CS267 Lecture 13
Short explanations of algorithms on previous slide
° Sorted in two orders (roughly):
• from slowest to fastest on sequential machines
• from most general (works on any matrix) to most specialized (works on matrices “like” Poisson)
° Dense LU: Gaussian elimination; works on any N-by-N matrix
° Band LU: exploit fact that T is nonzero only on sqrt(N) diagonals nearest main
diagonal, so faster
° Jacobi: essentially does matrix-vector multiply by T in inner loop of iterative
algorithm
° Explicit Inverse: assume we want to solve many systems with T, so we can
precompute and store inv(T) “for free”, and just multiply by it
• It’s still expensive!
° Conjugate Gradients: uses matrix-vector multiplication, like Jacobi, but
exploits mathematical properies of T that Jacobi does not
° Red-Black SOR (Successive Overrelaxation): Variation of Jacobi that exploits
yet different mathematical properties of T
• Used in Multigrid
° Sparse LU: Gaussian elimination exploiting particular zero structure of T
° FFT (Fast Fourier Transform): works only on matrices very like T
° Multigrid: also works on matrices like T, that come from elliptic PDEs
° Lower Bound: serial (time to print answer); parallel (time to combine N inputs)
° 02/28/2007
Details in class notes and www.cs.berkeley.edu/~demmel/ma221
CS267 Lecture 13
Irregular mesh: Tapered Tube (multigrid)
02/28/2007
CS267 Lecture 13
Composite mesh from a mechanical structure
02/28/2007
CS267 Lecture 13