Transcript Document 7371714
N ATIONAL C ENTER E NERGY R ESEARCH S CIENTIFIC C OMPUTING
Libraries and Program Performance
NERSC User Services Group
N ATIONAL C ENTER E NERGY R ESEARCH S CIENTIFIC C OMPUTING
An Embarrassment of Riches: Serial
N ATIONAL C ENTER E NERGY R ESEARCH S CIENTIFIC C OMPUTING
Threaded Libraries (Threaded)
N ATIONAL E NERGY R ESEARCH S CIENTIFIC C OMPUTING C ENTER
Parallel Libraries (Distributed & Threaded)
N ATIONAL C ENTER E NERGY R ESEARCH S CIENTIFIC C OMPUTING
Elementary Math Functions
• Three libraries provide elementary math functions: – C/Fortran intrinsics – MASS/MASSV (Math Acceleration Subroutine System) – ESSL/PESSL (Engineering Scientific Subroutine Library • Language intrinsics are the most convenient, but not the best performers
N ATIONAL C ENTER E NERGY R ESEARCH S CIENTIFIC C OMPUTING
Elementary Functions in Libraries
• MASS –
sqrt atan2 rsqrt sinh exp log sin cosh tanh cos dnint tan x**y atan
• MASSV –
cos dint exp log sin log tan div rsqrt sqrt atan
See
http://www.nersc.gov/nusers/resources/software/libs/math/MASS/
N ATIONAL C ENTER E NERGY R ESEARCH S CIENTIFIC C OMPUTING
Other Intrinsics in Libraries
• ESSL – Linear Algebra Subprograms – Matrix Operations – Linear Algebraic Equations – Eigensystem Analysis – Fourier Transforms, Convolutions, Correlations, and Related Computations – Sorting and Searching – Interpolation – Numerical Quadrature – Random Number Generation
N ATIONAL C ENTER E NERGY R ESEARCH S CIENTIFIC C OMPUTING
Comparing Elementary Functions
• Loop schema for elementary functions:
99 write(6,98) 98 format( " sqrt: " ) x = pi/4.0
call f_hpmstart(1,"sqrt") do 100 i = 1, loopceil y = sqrt(x) 100 continue x = y * y call f_hpmstop(1) write(6,101) x 101 format( " x = ", g21.14 )
N ATIONAL C ENTER E NERGY R ESEARCH S CIENTIFIC C OMPUTING
Comparing Elementary Functions
• Execution schema for elementary functions:
setenv F1 "-qfixed=80 -qarch=pwr3 -qtune=pwr3 O3 -qipa" module load hpmtoolkit module load mass module list setenv L1 "-Wl,-v,-bC:massmap" xlf90_r masstest.F $F1 $L1 $MASS $HPMTOOLKIT -o masstest timex mathtest < input > mathout
N ATIONAL C ENTER E NERGY R ESEARCH S CIENTIFIC C OMPUTING
Results Examined
• Answers after 50e6 iterations • User execution time • # Floating and FMA instructions • Operation rate in Mflip/sec
N ATIONAL C ENTER E NERGY R ESEARCH S CIENTIFIC C OMPUTING
Results Observed
• No difference in answers • Best times/rates at -O3 or -O4 • ESSL no different from intrinsics • MASS much faster than intrinsics
N ATIONAL C ENTER E NERGY R ESEARCH S CIENTIFIC C OMPUTING
N ATIONAL C ENTER E NERGY R ESEARCH S CIENTIFIC C OMPUTING
N ATIONAL C ENTER E NERGY R ESEARCH S CIENTIFIC C OMPUTING
N ATIONAL C ENTER E NERGY R ESEARCH S CIENTIFIC C OMPUTING
Comparing Higher Level Functions
• Several sources of matrix-multiply function: – User coded scalar computation – Fortran intrinsic
matmul
– Single processor ESSL
dgemm
– Multi-threaded SMP ESSL
dgemm
– Single processor IMSL
dmrrrr
– Single processor NAG
f01ckf
(32-bit) – Multi-threaded SMP NAG
f01ckf
N ATIONAL C ENTER E NERGY R ESEARCH S CIENTIFIC C OMPUTING
Sample Problem
• Multiply dense matrixes : –
A(1:n,1:n) = i + j
–
B(1:n,1:n) = j – i
–
C(1:n,1:n) = A * B
Output
C
to verify result
N ATIONAL C ENTER E NERGY R ESEARCH S CIENTIFIC C OMPUTING
Kernel of user matrix multiply
do i=1,n do j=1,n enddo a(i,j) = real(i+j) b(i,j) = real(j-i) enddo call f_hpmstart(1,"Matrix multiply") do j=1,n do k=1,n do i=1,n enddo c(i,j) = c(i,j) + a(i,k)*b(k,j) enddo enddo call f_hpmstop(1)
N ATIONAL E NERGY R ESEARCH S CIENTIFIC C OMPUTING C ENTER
Comparison of Matrix Multiply
(N1=5,000) Version User Scalar Intrinsic ESSL IMSL NAG ESSL-SMP NAG-SMP Wall Clock(sec) 1,490 1,477 195 194 195 14 14 Mflip/s 168 169 1,280 1,290 1,280 17,800 17,800 Scaled Time (1=Fastest) 106 Slowest 106 13.9
Slowest 13.8
13.9
1.0
1.0
Fastest Fastest
N ATIONAL C ENTER E NERGY R ESEARCH S CIENTIFIC C OMPUTING
Observations on Matrix Multiply
• Fastest times were obtained by the two SMP libraries, ESSL-SMP and NAG-SMP, which both obtained 74% of the peak node performance • All the single processor library functions took 14 times more wall clock time than the SMP versions, each obtaining about 85% of peak for a single processor • Worst times were from the user code and the Fortran intrinsic, which took 100 times more wall clock time than the SMP libraries
N ATIONAL E NERGY R ESEARCH S CIENTIFIC C OMPUTING C ENTER
Comparison of Matrix Multiply
(N2=10,000) Version ESSL-SMP NAG-SMP Wall Clock(sec) 101 100 Mflip/s 19,800 Scaled Time 1.01
19,900 1.00
• Scaling with Problem Size (Complexity increase ~8x) Version ESSL-SMP NAG-SMP Wall Clock(N2/N1) 7.2
7.1
Mflip/s (N2/N1) 1.10
1.12
Both ESSL-SMP and NAG-SMP showed 10% performance gains with the larger problem size.
N ATIONAL C ENTER E NERGY R ESEARCH S CIENTIFIC C OMPUTING
Observations on Scaling
• Scaling of problem size was only done for the SMP libraries, to fit into reasonable times.
• Doubling N results in 8 times increase of computational complexity for dense matrix multiplication • Performance actually increased for both routines for larger problem size.
N ATIONAL C ENTER E NERGY R ESEARCH S CIENTIFIC C OMPUTING
ESSL-SMP Performance vs. Number of Threads • All for N=10,000 • Number of threads controlled by environment variable
OMP_NUM_THREADS
20000 16000 12000 8000 4000 0 0 4 8 12 16 20
Threads
24 28 32 36
N ATIONAL C ENTER E NERGY R ESEARCH S CIENTIFIC C OMPUTING
Parallelism: Choices Based on Problem Size Doing a month’s work in a few minutes!
Three Good Choices • ESSL / LAPACK • ESSL-SMP • SCALAPACK Only beyond a certain problem size is there any opportunity for parallelism.
Matrix-Matrix Multiply
N ATIONAL C ENTER E NERGY R ESEARCH S CIENTIFIC C OMPUTING
Larger Functions: FFT’s
• ESSL, FFTW, NAG, IMSL See
http://www.nersc.gov/nusers/resources/software/libs/math/fft/
• We looked at ESSL, NAG, and IMSL – One-D, forward and reverse
N ATIONAL C ENTER E NERGY R ESEARCH S CIENTIFIC C OMPUTING
One-D FFT’s
• NAG – – – –
c06eaf c06ebf
- forward - inverse, conjugate needed
c06faf c06ebf
- forward, work-space needed - inverse, work-space & conjugate needed • IMSL –
z_fast_dft
- forward & reverse, separate arrays • ESSL –
drcft
- forward & reverse, work-space & initialization step needed • All have size constraints on their data sets
N ATIONAL C ENTER E NERGY R ESEARCH S CIENTIFIC C OMPUTING
One-D FFT Measurement
• 2^24
real*8
data points input (a synthetic signal) • Each transform ran in a 20-iteration loop • All performed both forward and inverse transforms on same data • Input and inverse outputs were identical • Measured with HPMToolkit
second1 = rtc() call f_hpmstart(33,"nag2 forward") do loopind=1, loopceil w(1:n) = x(1:n) end do call c06faf( w, n, work, ifail ) call f_hpmstop(33) second2 = rtc()
NAG
N ATIONAL C ENTER E NERGY R ESEARCH S CIENTIFIC C OMPUTING
One-D FFT Performance
c06eaf C06ebf c06faf
fwd inv fwd 25.182 sec.
24.465 sec.
29.451 sec.
54.006 Mflip/s 40.666 Mflip/s 46.531 Mflip/s
c06ebf
inv 24.469 sec.
40.663 Mflip/s (required a data copy for each iteration for each transform) IMSL
z_fast_dft z_fast_dft
fwd inv ESSL
drcft drcft drcft drcft
init fwd init inv 7 1.479 sec.
71.152 sec.
46.027 Mflip/s 48.096 Mflip/s 0.032 sec .
62.315 Mflip/s 3.573 sec. 274.009 Mflip/s 0.058 sec.
96.384 Mflip/s 3.616 sec. 277.650 Mflip/s
N ATIONAL C ENTER E NERGY R ESEARCH S CIENTIFIC C OMPUTING
N ATIONAL C ENTER E NERGY R ESEARCH S CIENTIFIC C OMPUTING
N ATIONAL C ENTER E NERGY R ESEARCH S CIENTIFIC C OMPUTING
N ATIONAL C ENTER E NERGY R ESEARCH S CIENTIFIC C OMPUTING
ESSL and ESSL-SMP :
“Easy” parallelism ( -qsmp=omp –qessl -lomp –lesslsmp ) • For simple problems can dial in the local data size by adjusting number of threads • Cache reuse can lead to superlinear speed up. • NH II node has 128 MB Cache!
N ATIONAL C ENTER E NERGY R ESEARCH S CIENTIFIC C OMPUTING
Parallelism Beyond One Node : MPI
2D problem in 4 tasks 1 2 3 4 2D problem in 20 tasks 1 3 2 4 • • • • Distributed Data Decomposition Distributed parallelism (MPI) requires both local and global addressing contexts Dimensionality of decomposition can have profound impact on scalability Consider surface to volume ratio Surface = communication (MPI) Volume = local work (HPM) Decomposition is often cause of load imbalance which can reduce parallel efficiency
N ATIONAL C ENTER E NERGY R ESEARCH S CIENTIFIC C OMPUTING
Example FFTW: 3D
• Popular for its portability and performance • Also consider PESSL’s FFTs (not treated here) • Uses a slab (1D) data decomposition • Direct algorithms for transforms of dimensions of size: 1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,32,64 • For parallel FFTW calls transforms are done in place Plan : fftw3d_mpi_create_plan(MPI_COMM_WORLD, nx, ny, nz, FFTW_FORWARD,flags); fftwnd_mpi_local_sizes(plan, &lnx, &lxs, &lnyt, &lyst, &lsize); Transform : fftwnd_mpi(plan, 1, data, work, transform_flags); What are these?
N ATIONAL C ENTER E NERGY R ESEARCH S CIENTIFIC C OMPUTING
FFTW: Data Decomposition
Each MPI rank owns a portion of the problem nx
Local Address Context: Global Address Context: for(x=0;x
ny lxs lxs+lnx nz
for(x=0;x
N ATIONAL C ENTER E NERGY R ESEARCH S CIENTIFIC C OMPUTING
FFTW : Parallel Performance
• • FFT may be complex function of problem size . Prime factors of the dimensions and concurrency determine performance Consider data decompositions and paddings that lead to optimal local data sizes: cache use and prime factors
N ATIONAL C ENTER E NERGY R ESEARCH S CIENTIFIC C OMPUTING
FFTW : Wisdom
• • • • Runtime performance optimization, can be stored to file Wise options: FFTW_MEASURE | FFTW_USE_WISDOM Unwise options: FFTW_ESTIMATE Wisdom works better for serial FFTs. Some benefit for parallel must amortize increase in planning overhead.