Parallel Spectral Methods: Solving Elliptic Problems with FFTs Horst Simon [email protected] www.cs.berkeley.edu/~demmel/cs267_Spr09 04/15/2009 CS267 Lecture 20
Download ReportTranscript Parallel Spectral Methods: Solving Elliptic Problems with FFTs Horst Simon [email protected] www.cs.berkeley.edu/~demmel/cs267_Spr09 04/15/2009 CS267 Lecture 20
Parallel Spectral Methods: Solving Elliptic Problems with FFTs Horst Simon [email protected] www.cs.berkeley.edu/~demmel/cs267_Spr09 04/15/2009 CS267 Lecture 20 Motifs The Motifs (formerly “Dwarfs”) from “The Berkeley View” (Asanovic et al.) Motifs form key computational patterns Topic of this lecture 04/15/2009 CS267 Lecture 2 21 References ° Previous CS267 lectures ° Lecture by Geoffrey Fox: http://grids.ucs.indiana.edu/ptliupages/presentations/PC2007/cps615fft00.ppt ° FFTW project http://www.fftw.org ° Spiral project http://www.spiral.net 04/15/2009 CS267 Lecture 21 Poisson’s equation arises in many models 3D: 2u/x2 + 2u/y2 + 2u/z2 = f(x,y,z) 2D: 2u/x2 + 2u/y2 = f(x,y) 1D: d2u/dx2 = f(x) f represents the sources; also need boundary conditions ° Electrostatic or Gravitational Potential: Potential(position) ° Heat flow: Temperature(position, time) ° Diffusion: Concentration(position, time) ° Fluid flow: Velocity,Pressure,Density(position,time) ° Elasticity: Stress,Strain(position,time) ° Variations of Poisson have variable coefficients 04/15/2009 CS267 Lecture 21 Algorithms for 2D (3D) Poisson Equation (N = n2 (n3) vars) Algorithm Serial PRAM Memory #Procs ° Dense LU N3 N N2 N2 ° Band LU N2 (N7/3) N N3/2 (N5/3) N (N4/3) ° Jacobi N2 (N5/3) N (N2/3) N N ° Explicit Inv. N2 log N N2 N2 ° Conj.Gradients N3/2 (N4/3) N1/2(1/3) *log N N N ° Red/Black SOR N3/2 (N4/3) N1/2 (N1/3) N N ° Sparse LU N3/2 (N2) N1/2 N*log N (N4/3) N ° FFT N*log N log N N N ° Multigrid N log2 N N N log N N ° Lower bound N PRAM is an idealized parallel model with zero cost communication Reference: James Demmel, Applied Numerical Linear Algebra, SIAM, 1997. 04/15/2009 CS267 Lecture 21 Solving Poisson’s Equation with the FFT ° Express any 2D function defined in 0 x,y 1 as a series (x,y) = Sj Sk jk sin(p jx) sin(p ky) • Here jk are called Fourier coefficient of (x,y) ° The inverse of this is: jk = 4 1 dx 1 dy (x,y) sin(p jx) sin(p ky) 0 0 ° Poisson’s equation 2 / x2 + 2 / y2 = f(x,y) becomes Sj Sk (-p2j2 - p2k2) jk sin(p jx) sin(p ky) = Sj Sk fjk sin(p jx) sin(p ky) ° where fjk are Fourier coefficients of f(x,y) and f(x,y) = Sj Sk fjk sin(p jx) sin(p ky) ° This implies PDE can be solved exactly algebraically, jk = fjk / (-p2j2 - p2k2) 04/15/2009 CS267 Lecture 21 Solving Poisson’s Equation with the FFT ° So solution of Poisson’s equation involves the following steps ° 1) Find the Fourier coefficients fjk of f(x,y) by performing integral ° 2) Form the Fourier coefficients of by jk = fjk / (-p2j2 - p2k2) ° 3) Construct the solution by performing sum (x,y) ° There is another version of this (Discrete Fourier Transform) which deals with functions defined at grid points and not directly the continuous integral • Also the simplest (mathematically) transform uses exp(-2pijx) not sin(p jx) • Let us first consider 1D discrete version of this case • PDE case normally deals with discretized functions as these needed for other parts of problem 04/15/2009 CS267 Lecture 21 Serial FFT ° Let i=sqrt(-1) and index matrices and vectors from 0. ° The Discrete Fourier Transform of an m-element vector v is: F*v Where F is the m*m matrix defined as: F[j,k] = v (j*k) Where v is: v = e (2pi/m) = cos(2p/m) + i*sin(2p/m) v is a complex number with whose mth power vm =1 and is therefore called an mth root of unity ° E.g., for m = 4: v = i, 04/15/2009 v2 = -1, v3 = -i, CS267 Lecture 21 v4 = 1, Using the 1D FFT for filtering ° Signal = sin(7t) + .5 sin(5t) at 128 points ° Noise = random number bounded by .75 ° Filter by zeroing out FFT components < .25 04/15/2009 CS267 Lecture 21 Using the 2D FFT for image compression ° Image = 200x320 matrix of values ° Compress by keeping largest 2.5% of FFT components ° Similar idea used by jpeg 04/15/2009 CS267 Lecture 21 Related Transforms ° Most applications require multiplication by both F and inverse(F). ° Multiplying by F and inverse(F) are essentially the same. (inverse(F) is the complex conjugate of F divided by n.) ° For solving the Poisson equation and various other applications, we use variations on the FFT • The sin transform -- imaginary part of F • The cos transform -- real part of F ° Algorithms are similar, so we will focus on the forward FFT. 04/15/2009 CS267 Lecture 21 Serial Algorithm for the FFT ° Compute the FFT of an m-element vector v, F*v (F*v)[j] = S k = 0 F(j,k) * v(k) m-1 = S k = 0 v (j*k) * v(k) m-1 = S k = 0 (v j)k * v(k) m-1 = V(v j) ° Where V is defined as the polynomial V(x) = S k = 0 xk * v(k) m-1 04/15/2009 CS267 Lecture 21 Divide and Conquer FFT ° V can be evaluated using divide-and-conquer V(x) = S = m-1 k=0 (x)k * v(k) v[0] + x2*v[2] + x4*v[4] + … + x*(v[1] + x2*v[3] + x4*v[5] + … ) = Veven(x2) + x*Vodd(x2) ° V has degree m-1, so Veven and Vodd are polynomials of degree m/2-1 ° We evaluate these at points (v j)2 for 0<=j<=m-1 ° But this is really just m/2 different points, since (v (j+m/2) )2 = (v j *v m/2) )2 = v 2j *v m = (v j)2 ° So FFT on m points reduced to 2 FFTs on m/2 points • Divide and conquer! 04/15/2009 CS267 Lecture 21 Divide-and-Conquer FFT FFT(v, v, m) if m = 1 return v[0] else veven = FFT(v[0:2:m-2], v 2, m/2) vodd = FFT(v[1:2:m-1], v 2, m/2) precomputed v-vec = [v0, v1, … v (m/2-1) ] return [veven + (v-vec .* vodd), veven - (v-vec .* vodd) ] ° The .* above is component-wise multiply. ° The […,…] is construction an m-element vector from 2 m/2 element vectors This results in an O(m log m) algorithm. 04/15/2009 CS267 Lecture 21 An Iterative Algorithm ° The call tree of the d&c FFT algorithm is a complete binary tree of log m levels FFT(0,1,2,3,…,15) = FFT(xxxx) even odd FFT(0,2,…,14) = FFT(xxx0) FFT(xx00) FFT(xx10) FFT(1,3,…,15) = FFT(xxx1) FFT(xx01) FFT(xx11) FFT(x000) FFT(x100) FFT(x010) FFT(x110) FFT(x001) FFT(x101) FFT(x011) FFT(x111) FFT(0) FFT(8) FFT(4) FFT(12) FFT(2) FFT(10) FFT(6) FFT(14) FFT(1) FFT(9) FFT(5) FFT(13) FFT(3) FFT(11) FFT(7) FFT(15) ° An iterative algorithm that uses loops rather than recursion, goes each level in the tree starting at the bottom • Algorithm overwrites v[i] by (F*v)[bitreverse(i)] ° Practical algorithms combine recursion (for memory hiearchy) and iteration (to avoid function call overhead) 04/15/2009 CS267 Lecture 21 Parallel 1D FFT ° Data dependencies in 1D FFT • Butterfly pattern ° A PRAM algorithm takes O(log m) time • each step to right is parallel • there are log m steps ° What about communication cost? ° See LogP paper for details 04/15/2009 CS267 Lecture 21 Block Layout of 1D FFT ° Using a block layout (m/p contiguous elts per processor) ° No communication in last log m/p steps ° Each step requires finegrained communication in first log p steps 04/15/2009 CS267 Lecture 21 Cyclic Layout of 1D FFT ° Cyclic layout (only 1 element per processor, wrapped) ° No communication in first log(m/p) steps ° Communication in last log(p) steps 04/15/2009 CS267 Lecture 21 Parallel Complexity ° m = vector size, p = number of processors ° f = time per flop = 1 ° a = startup for message (in f units) ° b = time per word in a message (in f units) ° Time(blockFFT) = Time(cyclicFFT) = 2*m*log(m)/p + log(p) * a + m*log(p)/p * b 04/15/2009 CS267 Lecture 21 FFT With “Transpose” ° If we start with a cyclic layout for first log(p) steps, there is no communication ° Then transpose the vector for last log(m/p) steps ° All communication is in the transpose ° Note: This example has log(m/p) = log(p) • If log(m/p) > log(p) more phases/layouts will be needed • We will work with this assumption for simplicity 04/15/2009 CS267 Lecture 21 Why is the Communication Step Called a Transpose? ° Analogous to transposing an array ° View as a 2D array of n/p by p ° Note: same idea is useful for uniprocessor caches 04/15/2009 CS267 Lecture 21 Complexity of the FFT with Transpose ° If no communication is pipelined (overestimate!) ° Time(transposeFFT) = 2*m*log(m)/p same as before + (p-1) * a was log(p) * a + m*(p-1)/p2 * b was m* log(p)/p * b ° If communication is pipelined, so we do not pay for p-1 messages, the second term becomes simply a, rather than (p-1)a. ° This is close to optimal. See LogP paper for details. ° See also following papers on class resource page • A. Sahai, “Hiding Communication Costs in Bandwidth Limited FFT” • R. Nishtala et al, “Optimizing bandwidth limited problems using onesided communication” 04/15/2009 CS267 Lecture 21 Comment on the 1D Parallel FFT ° The above algorithm leaves data in bit-reversed order • Some applications can use it this way, like Poisson • Others require another transpose-like operation ° Other parallel algorithms also exist • A very different 1D FFT is due to Edelman (see http://wwwmath.mit.edu/~edelman) • Based on the Fast Multipole algorithm • Less communication for non-bit-reversed algorithm 04/15/2009 CS267 Lecture 21 Higher Dimension FFTs ° FFTs on 2 or 3 dimensions are define as 1D FFTs on vectors in all dimensions. ° E.g., a 2D FFT does 1D FFTs on all rows and then all columns ° There are 3 obvious possibilities for the 2D FFT: • (1) 2D blocked layout for matrix, using 1D algorithms for each row and column • (2) Block row layout for matrix, using serial 1D FFTs on rows, followed by a transpose, then more serial 1D FFTs • (3) Block row layout for matrix, using serial 1D FFTs on rows, followed by parallel 1D FFTs on columns • Option 2 is best, if we overlap communication and computation ° For a 3D FFT the options are similar • 2 phases done with serial FFTs, followed by a transpose for 3rd • can overlap communication with 2nd phase in practice 04/15/2009 CS267 Lecture 21 FFTW – Fastest Fourier Transform in the West ° www.fftw.org ° Produces FFT implementation optimized for • • • • Your version of FFT (complex, real,…) Your value of n (arbitrary, possibly prime) Your architecture Close to optimal for serial, can be improved for parallel ° Similar in spirit to PHIPAC/ATLAS/Sparsity ° Won 1999 Wilkinson Prize for Numerical Software ° Widely used for serial FFTs • Had parallel FFTs in version 2, but no longer supporting them • Layout constraints from users/apps + network differences are hard to support 04/15/2009 CS267 Lecture 21 Bisection Bandwidth ° FFT requires one (or more) transpose operations: • Ever processor send 1/P of its data to each other one ° Bisection Bandwidth limits this performance • Bisection bandwidth is the bandwidth across the narrowest part of the network • Important in global transpose operations, all-to-all, etc. ° “Full bisection bandwidth” is expensive • • • • Fraction of machine cost in the network is increasing Fat-tree and full crossbar topologies may be too expensive Especially on machines with 100K and more processors SMP clusters often limit bandwidth at the node level 04/15/2009 CS267 Lecture 21 Modified LogGP Model ° LogGP: no overlap ° LogGP: no overlap P0 P0 osend g L orecv P1 EEL: end to end latency (1/2 roundtrip) g: minimum time between small message sends G: additional gap per byte for larger messages 04/15/2009 CS267 Lecture 21 P1 Historical Perspective 25 ½ round-trip latency Added Latency Send Overhead (Alone) 20 Send & Rec Overhead usec 15 Rec Overhead (Alone) 10 5 Q ua dr i Q cs/ ua Sh dr m ic s/ M PI M yr in e M t/G yr in M et /M PI G ig E/ V G IP L ig E/ M PI A IB PI M /M P I /L IB M T3 E T3 /Sh m E /E -R T 3 eg E /M P I 0 • Potential performance advantage for fine-grained, one-sided programs • Potential productivity advantage for irregular applications 04/15/2009 CS267 Lecture 21 General Observations ° The overlap potential is the difference between the gap and overhead • No potential if CPU is tied up throughout message send - E.g., no send size DMA • Grows with message size for machines with DMA (per byte cost is handled by network) - Because per-Byte cost is handled by NIC • Grows with amount of network congestion - Because gap grows as network becomes saturated ° Remote overhead is 0 for machine with RDMA 04/15/2009 CS267 Lecture 21 GASNet Communications System GASNet offers put/get communication ° One-sided: no remote CPU involvement required in API (key difference with MPI) • Message contains remote address • No need to match with a receive • No implicit ordering required Compiler-generated code ° Used in language runtimes (UPC, etc.) ° Fine-grained and bulk xfers Language-specific runtime GASNet ° Split-phase communication Network Hardware 04/15/2009 CS267 Lecture 21 Performance of 1-Sided vs 2-sided Communication: GASNet vs MPI 900000 gasnet_put_nbi_bulk 800000 MPI Flood 700000 Bandwidth (KB/s) 600000 Relative BW 500000 (put_nbi_bulk/MPI_Flood) 2.4 400000 2.2 2.0 Up is good 300000 1.8 1.6 200000 1.4 1.2 1.0 100000 10 1000 100000 10000000 Size (bytes) 0 10 100 1,000 10,000 100,000 1,000,000 10,000,000 Size (bytes) ° Comparison on Opteron/InfiniBand – GASNet’s vapi-conduit and OSU MPI 0.9.5 ° Up to large message size (> 256 Kb), GASNet provides up to 2.2X improvement in streaming bandwidth ° Half power point (N/2) differs by one order of magnitude 04/15/2009 CS267 Lecture 21 GASNet: Performance for mid-range message sizes Flood Bandwidth for 4KB messages (13,196) 100% 223 90% 763 231 Percent HW peak (up is good) 80% 70% MPI 714 702 GASNet 679 190 152 60% 750 420 50% 40% 547 1189 252 30% 20% 10% 0% Elan3/Alpha Elan4/IA64 Myrinet/x86 IB/G5 IB/Opteron SP/Fed Altix GASNet usually reaches saturation bandwidth before MPI - fewer costs to amortize Usually outperform MPI at medium message sizes - often by a large margin 04/15/2009 CS267 Lecture 21 NAS FT Case Study ° Performance of Exchange (Alltoall) is critical • Communication to computation ratio increases with faster, more optimized 1-D FFTs • Determined by available bisection bandwidth • Between 30-40% of the applications total runtime ° Two ways to reduce Exchange cost 1. Use a better network (higher Bisection BW) 2. Overlap the all-to-all with communication (where possible) – “break up” the exchange Default NAS FT Fortran/MPI relies on #1 Our approach uses UPC/GASNet and builds on #2 ° Started as CS267 project • 1D partition of 3D grid is a limitation • At most N processors for N^3 grid • HPC Challenge benchmark has large 1D FFT (can be viewed as 3D or more with proper roots of unity) 04/15/2009 CS267 Lecture 21 3D FFT Operation with Global Exchange 1D-FFT Columns 1D-FFT (Columns) Transpose + 1D-FFT (Rows) Cachelines send to Thread 0 Transpose + 1D-FFT 1D-FFT Rows Exchange (Alltoall) send to Thread 1 send to Thread 2 Divide rows among threads Last 1D-FFT (Thread 0’s view) ° Single Communication Operation (Global Exchange) sends THREADS large messages ° Separate computation and communication phases 04/15/2009 CS267 Lecture 21 Communication Strategies for 3D FFT chunk = all rows with same destination ° Three approaches: • Chunk: - Wait for 2nd dim FFTs to finish - Minimize # messages • Slab: - Wait for chunk of rows destined for 1 proc to finish - Overlap with computation • Pencil: - Send each row as it completes - Maximize overlap and - Match natural layout Joint work with Chris Bell, Rajesh Nishtala, Dan Bonachea 04/15/2009 pencil = 1 row slab = all rows in a single plane with same destination CS267 Lecture 21 Decomposing NAS FT Exchange into Smaller Messages ° Three approaches: • Chunk: - Wait for 2nd dim FFTs to finish • Slab: - Wait for chunk of rows destined for 1 proc to finish • Pencil: - Send each row as it completes ° Example Message Size Breakdown for • Class D (2048 x 1024 x 1024) • at 256 processors Exchange (Default) Slabs (set of contiguous rows) Pencils (single row) 04/15/2009 CS267 Lecture 21 512 Kbytes 65 Kbytes 16 Kbytes Overlapping Communication ° Goal: make use of “all the wires” • Distributed memory machines allow for asynchronous communication • Berkeley Non-blocking extensions expose GASNet’s nonblocking operations ° Approach: Break all-to-all communication • Interleave row computations and row communications since 1DFFT is independent across rows • Decomposition can be into slabs (contiguous sets of rows) or pencils (individual row) • Pencils allow: - Earlier start for communication “phase” and improved local cache use - But more smaller messages (same total volume) 04/15/2009 CS267 Lecture 21 NAS FT: UPC Non-blocking MFlops MFlop rate in UPC Blocking and Non-blocking FT 1000 UPC Blocking UPC Non-Blocking MFlops per Thread 800 600 400 200 0 56 et 6 4 nd 2 a B i Myr in Infin 3 256 Elan 3 512 Elan 4 256 Elan ° Berkeley UPC compiler support non-blocking UPC extensions ° Produce 15-45% speedup over best UPC Blocking version ° Non-blocking version requires about 30 extra lines of UPC code 04/15/2009 CS267 Lecture 21 4 512 Elan NAS FT Variants Performance Summary ° Shown are the largest classes/configurations possible on each test machine ° MPI not particularly tuned for many small/medium size messages in flight (long message matching queue depths) 04/15/2009 CS267 Lecture 21 Pencil/Slab optimizations: UPC vs MPI Time Spent in Communication (Best MPI) Fraction of Unoverlapped MPI Communication that UPC Effectively Overlaps with Computation Best MPI and Best UPC for each System (Class/NProcs) 1 0.8 0.6 0.4 0.2 0 et 6 4 Myr in 256 Ban d i in f n I 3 256 Elan 3 512 Elan ° Same data, viewed in the context of what MPI is able to overlap ° “For the amount of time that MPI spends in communication, how much of that time can UPC effectively overlap with computation” ° On Infiniband, UPC overlaps almost all the time the MPI spends in communication ° On Elan3, UPC obtains more overlap than MPI as the problem scales up 04/15/2009 CS267 Lecture 21 Summary of Overlap in FFTs ° One-sided communication has performance advantages • Better match for most networking hardware - Most cluster networks have RDMA support - Machines with global address space support (X1, Altix) shown elsewhere • Smaller messages may make better use of network - Spread communication over longer period of time - Postpone bisection bandwidth pain • Smaller messages can also prevent cache thrashing for packing - Avoid packing overheads if natural message size is reasonable 04/15/2009 CS267 Lecture 21 FFTW the “Fastest Fourier Tranform in the West” • C library for real & complex FFTs (arbitrary size/dimensionality) (+ parallel versions for threads & MPI) • Computational kernels (80% of code) automatically generated • Self-optimizes for your hardware (picks best composition of steps) = portability + performance free software: http://www.fftw.org/ 04/15/2009 CS267 Lecture 21 FFTW performance power-of-two sizes, double precision 833 MHz Alpha EV6 2 GHz AMD Opteron 04/15/2009 2 GHz PowerPC G5 500 MHz Ultrasparc IIe CS267 Lecture 21 FFTW performance non-power-of-two sizes, double precision unusual: non-power-of-two sizes receive as much optimization as powers of two 833 MHz Alpha EV6 2 GHz AMD Opteron …because we let the code do the optimizing 04/15/2009 CS267 Lecture 21 FFTW performance double precision, 2.8GHz Pentium IV: 2-way SIMD (SSE2) powers of two exploiting CPU-specific SIMD instructions (rewriting the code) is easy non-powers-of-two …because we let the code write itself 04/15/2009 CS267 Lecture 21 Why is FFTW fast? three unusual features 3 FFTW implements many FFT algorithms: A planner picks the best composition by measuring the speed of different combinations. 1 The resulting plan is executed with explicit recursion: enhances locality The base cases of the recursion are codelets: highly-optimized dense code automatically generated by a special-purpose “compiler” 2 04/15/2009 CS267 Lecture 21 FFTW is easy to use { complex x[n]; plan p; p = plan_dft_1d(n, x, x, FORWARD, MEASURE); ... execute(p); /* repeat as needed */ ... destroy_plan(p); } 04/15/2009 Key fact: usually, many transforms of same size are required. CS267 Lecture 21 Why is FFTW fast? three unusual features 3 FFTW implements many FFT algorithms: A planner picks the best composition by measuring the speed of different combinations. 1 The resulting plan is executed with explicit recursion: enhances locality The base cases of the recursion are codelets: highly-optimized dense code automatically generated by a special-purpose “compiler” 2 04/15/2009 CS267 Lecture 21 FFTW Uses Natural Recursion Size 8 DFT p = 2 (radix 2) Size 4 DFT Size 2 DFT Size 4 DFT Size 2 DFT Size 2 DFT Size 2 DFT But traditional implementation is non-recursive, breadth-first traversal: log2 n passes over whole array 04/15/2009 CS267 Lecture 21 Traditional cache solution: Blocking Size 8 DFT p = 2 (radix 2) Size 4 DFT Size 2 DFT Size 4 DFT Size 2 DFT Size 2 DFT Size 2 DFT breadth-first, but with blocks of size = cache …requires program specialized for cache size 04/15/2009 CS267 Lecture 21 Recursive Divide & Conquer is Good [Singleton, 1967] (depth-first traversal) Size 8 DFT p = 2 (radix 2) Size 4 DFT Size 2 DFT 04/15/2009 Size 2 DFT Size 4 DFT Size 2 DFT Size 2 DFT eventually small enough to fit in cache …no matter CS267 whatLecture size 21the cache is Cache Obliviousness • A cache-oblivious algorithm does not know the cache size — it can be optimal for any machine & for all levels of cache simultaneously • Exist for many other algorithms, too [Frigo et al. 1999] — all via the recursive divide & conquer approach 04/15/2009 CS267 Lecture 21 Why is FFTW fast? three unusual features 3 FFTW implements many FFT algorithms: A planner picks the best composition by measuring the speed of different combinations. 1 The resulting plan is executed with explicit recursion: enhances locality The base cases of the recursion are codelets: highly-optimized dense code automatically generated by a special-purpose “compiler” 2 04/15/2009 CS267 Lecture 21