The Future of LAPACK and ScaLAPACK

Download Report

Transcript The Future of LAPACK and ScaLAPACK

The Future of
LAPACK and ScaLAPACK
Jim Demmel
UC Berkeley
CScADS Autotuning Workshop
9 July 2007
Challenges
• Increasing parallelism
– From supercomputers to multicore
• Exponentially growing gaps between
– Floating point time << 1/Memory BW << Memory Latency
• Improving 59%/year vs 23%/year vs 5.5%/year
– Floating point time << 1/Network BW << Network Latency
• Improving 59%/year vs 26%/year vs 15%/year
• Heterogeneity (performance and semantics)
• Asynchrony
• Unreliability
Goals of next Sca/LAPACK
1. Better algorithms
– Faster, more accurate
2.
3.
4.
5.
Automate performance tuning
Better software engineering
Improve ease of use
Expand contents
– More functions, more parallel implementations
6. Increased community involvement
What could go into Sca/LAPACK?
For all linear algebra problems
For all matrix structures
For all data types
For all architectures and networks
For all programming interfaces
Produce best algorithm(s) w.r.t.
performance and accuracy
(including condition estimates, etc)
Need to prioritize, automate!
How do we best explore this large tuning space?
•
Algorithm tuning space includes
– Numerous block sizes, not just in underlying BLAS
– Many possible layers of parallelism, many mappings to HW
– Different traversals of underlying DAGs
• Left and right looking two of many; asynchronous algorithms
– “Redundant” algorithms
– Recursive layouts and algorithms
– New “optimal” algorithms for variations on standard factorizations
– New and old eigenvalue algorithms
– Redundancy, mixed precision, …
•
Is there a concise set of abstractions to describe, generate tuning space?
– Block matrices, (partial, tree) factorizations, DAGs, …
– FLAME, CSS, Spiral, Sequoia, Telescoping languages, Bernoulli, Rose, …
•
Challenge: What fraction of dense linear algebra can be done?
– Lots more than when we started
• Parallel BLAS -> LU -> other factorizations -> …
– Most of dense linear algebra?
• Not eigenvalue algorithms (on compact forms)
• What fraction of LAPACK can be done?
• Rest of loop “for all linear algebra problems…”
– For all interesting architectures…?
Exploiting GPUs
• Numerous emerging co-processors
– Cell, SSE, Grape, GPU, “physics coprocessor,” …
• When can we exploit them?
– LIttle help if memory is bottleneck
– Various attempts to use GPUs for dense linear algebra
• Bisection on GPUs for symmetric tridiagonal
eigenproblem
– Evaluate Count(x) = #(evals < x) for many x
– Very little memory traffic, but much redundant work
– Speedups up to 100x (Volkov)
• 43 Gflops on ATI Radeon X1900 vs running on 2.8 GHz Pentium 4
• Overall eigenvalue solver 6.8x faster
• Port of CLAPACK to NVIDIA underway
Iterative Refinement: For Accuracy
Conventional Gaussian Elimination
1/e
e
e = n1/2 2-24
With extra precise
iterative refinement
Iterative Refinement: For Speed
• What if double precision much slower than
single?
– Cell processor in Playstation 3
• 256 GFlops single, 25 GFlops double
– Pentium SSE2: single twice as fast as double
• Given Ax=b in double precision
– Factor in single, do refinement in double
– If k(A) < 1/esingle, runs at speed of single
• 8x speedup on Cell, 1.9x on Intel-based laptop
• Applies to many algorithms, if difference large
New algorithm for roots(p)
-p1 -p2
1 0
– Roots(p) calls eig(C(p))
C(p)= 0 1
– Costs O(n3), stable, reliable
……
• O(n2) Alternatives
0 …
– Newton, Jenkins-Traub, Laguerre, …
• To find roots of polynomial p
… -pd
… 0
… 0
… …
1 0
– Stable? Reliable?
• New: Exploit “semiseparable” structure of C(p)
– Low rank of any submatrix of upper triangle of C(p)
preserved under QR iteration
– Complexity drops from O(n3) to O(n2), stable in practice
• Related work: Gemignani, Bini, Pan, et al
• Ming Gu, Shiv Chandrasekaran, Jiang Zhu, Jianlin Xia, David
Bindel, David Garmire, Jim Demmel
New optimal algorithms (1)
• Long known (Strassen) how to invert
matrices as fast as matmul, but unstably:
-1
T11 T12 = T11-1 -T11-1 T12T22-1
T22
T22-1
• New results
– Can make solving Ax=b, least squares,
eigenvalue problems as fast as fastest
matmul, and stable
• “Optimal” arithmetic complexity, but fast?
New optimal algorithms (2)
• Communication costs of current ScaLAPACK
– LU & QR: O(n log p) messages
– Cholesky: O(n/b log p) messages
• New “LU” and “QR” algorithms
– As few messages as Cholesky
– “QR” returns QR but represented differently
– “LU” “equivalent” to LU in complexity, stability TBD
• “Optimal” communication complexity, but fast?
Goal 2 – Automate Performance Tuning
• 1300 calls to ILAENV() to get block sizes, etc.
– Never been systematically tuned
• Extend automatic tuning techniques of ATLAS, etc. to
these other parameters
– Automation important as architectures evolve
• Convert ScaLAPACK data layouts on the fly
– Important for ease-of-use too
ScaLAPACK Data Layouts
1D Block
1D Block
Cyclic
1D Cyclic
2D Block
Cyclic
Execution time of PDGESV for various grid shape
100
seconds
90-100
90
80-90
80
70-80
70
60-70
60
50-60
40-50
50
1x60
40
2x30
30
3x20
20
4x15
10
grid shape
5x12
0
6x10
10000 9000
8000
7000
6000
5000
4000
3000
2000
1000
problem size
Speedups for using 2D processor grid range from 2x to 8x
Cost of redistributing from 1D to best 2D layout 1% - 10%
Times obtained on:
60 processors, Dual AMD Opteron 1.4GHz Cluster w/Myrinet Interconnect
2GB Memory
30-40
20-30
10-20
0-10
Intel’s Clovertown Quad Core
3 Implementations of LU factorization
Quad core w/2 sockets per board, w/ 8 Treads
1. LAPACK (BLAS Fork-Join Parallelism)
2. ScaLAPACK (Mess Pass using mem copy)
3. DAG Based (Dynamic Scheduling)
45000
40000
35000
Mflop/s
30000
25000
20000
15000
8 Core Experiments
10000
5000
0
1000
2000
3000
4000
15
Source:
Jack Dongarra
5000
6000
7000
8000
9000 10000 11000 12000 13000 14000 15000
Problems Size
Motivation
• LAPACK and ScaLAPACK are widely used
– Adopted by Cray, Fujitsu, HP, IBM, IMSL, MathWorks,
NAG, NEC, SGI, …
– >72M web hits @ Netlib (incl. CLAPACK, LAPACK95)
• 35K hits/day
• Many ways to improve them, based on
–
–
–
–
Own algorithmic research
Enthusiastic participation of research community
Opportunities and demands of new architectures
Autotuning
• New releases planned (NSF support)
– LAPACK 3.1.1 released Feb 2007
Participants
• UC Berkeley:
– Jim Demmel, Ming Gu, W. Kahan, Beresford Parlett, Xiaoye Li,
Osni Marques, Christof Voemel, David Bindel, Yozo Hida,
Jason
Riedy, undergrads…
• U Tennessee, Knoxville
– Jack Dongarra, Julien Langou, Julie Langou, Piotr Luszczek,
Stan Tomov, Alfredo Buttari, Jakub Kurzak
• Other Academic Institutions
– UT Austin, U Colorado, UC Davis, Florida IT, U Kansas, U Maryland,
North Carolina SU, San Jose SU, UC Santa Barbara
– TU Berlin, U Electrocomm. (Japan), FU Hagen, U Carlos III Madrid,
U Manchester, U Umeå, U Wuppertal, U Zagreb
• Research Institutions
– CERFACS, LBL, INRIA
• Industrial Partners
– Cray, HP, Intel, Interactive Supercomputing, MathWorks, NAG, SGI