Document 7371714

Download Report

Transcript Document 7371714

N ATIONAL C ENTER E NERGY R ESEARCH S CIENTIFIC C OMPUTING

Libraries and Program Performance

NERSC User Services Group

N ATIONAL C ENTER E NERGY R ESEARCH S CIENTIFIC C OMPUTING

An Embarrassment of Riches: Serial

N ATIONAL C ENTER E NERGY R ESEARCH S CIENTIFIC C OMPUTING

Threaded Libraries (Threaded)

N ATIONAL E NERGY R ESEARCH S CIENTIFIC C OMPUTING C ENTER

Parallel Libraries (Distributed & Threaded)

N ATIONAL C ENTER E NERGY R ESEARCH S CIENTIFIC C OMPUTING

Elementary Math Functions

• Three libraries provide elementary math functions: – C/Fortran intrinsics – MASS/MASSV (Math Acceleration Subroutine System) – ESSL/PESSL (Engineering Scientific Subroutine Library • Language intrinsics are the most convenient, but not the best performers

N ATIONAL C ENTER E NERGY R ESEARCH S CIENTIFIC C OMPUTING

Elementary Functions in Libraries

• MASS –

sqrt atan2 rsqrt sinh exp log sin cosh tanh cos dnint tan x**y atan

• MASSV –

cos dint exp log sin log tan div rsqrt sqrt atan

See

http://www.nersc.gov/nusers/resources/software/libs/math/MASS/

N ATIONAL C ENTER E NERGY R ESEARCH S CIENTIFIC C OMPUTING

Other Intrinsics in Libraries

• ESSL – Linear Algebra Subprograms – Matrix Operations – Linear Algebraic Equations – Eigensystem Analysis – Fourier Transforms, Convolutions, Correlations, and Related Computations – Sorting and Searching – Interpolation – Numerical Quadrature – Random Number Generation

N ATIONAL C ENTER E NERGY R ESEARCH S CIENTIFIC C OMPUTING

Comparing Elementary Functions

• Loop schema for elementary functions:

99 write(6,98) 98 format( " sqrt: " ) x = pi/4.0

call f_hpmstart(1,"sqrt") do 100 i = 1, loopceil y = sqrt(x) 100 continue x = y * y call f_hpmstop(1) write(6,101) x 101 format( " x = ", g21.14 )

N ATIONAL C ENTER E NERGY R ESEARCH S CIENTIFIC C OMPUTING

Comparing Elementary Functions

• Execution schema for elementary functions:

setenv F1 "-qfixed=80 -qarch=pwr3 -qtune=pwr3 O3 -qipa" module load hpmtoolkit module load mass module list setenv L1 "-Wl,-v,-bC:massmap" xlf90_r masstest.F $F1 $L1 $MASS $HPMTOOLKIT -o masstest timex mathtest < input > mathout

N ATIONAL C ENTER E NERGY R ESEARCH S CIENTIFIC C OMPUTING

Results Examined

• Answers after 50e6 iterations • User execution time • # Floating and FMA instructions • Operation rate in Mflip/sec

N ATIONAL C ENTER E NERGY R ESEARCH S CIENTIFIC C OMPUTING

Results Observed

• No difference in answers • Best times/rates at -O3 or -O4 • ESSL no different from intrinsics • MASS much faster than intrinsics

N ATIONAL C ENTER E NERGY R ESEARCH S CIENTIFIC C OMPUTING

N ATIONAL C ENTER E NERGY R ESEARCH S CIENTIFIC C OMPUTING

N ATIONAL C ENTER E NERGY R ESEARCH S CIENTIFIC C OMPUTING

N ATIONAL C ENTER E NERGY R ESEARCH S CIENTIFIC C OMPUTING

Comparing Higher Level Functions

• Several sources of matrix-multiply function: – User coded scalar computation – Fortran intrinsic

matmul

– Single processor ESSL

dgemm

– Multi-threaded SMP ESSL

dgemm

– Single processor IMSL

dmrrrr

– Single processor NAG

f01ckf

(32-bit) – Multi-threaded SMP NAG

f01ckf

N ATIONAL C ENTER E NERGY R ESEARCH S CIENTIFIC C OMPUTING

Sample Problem

• Multiply dense matrixes : –

A(1:n,1:n) = i + j

B(1:n,1:n) = j – i

C(1:n,1:n) = A * B

Output

C

to verify result

N ATIONAL C ENTER E NERGY R ESEARCH S CIENTIFIC C OMPUTING

Kernel of user matrix multiply

do i=1,n do j=1,n enddo a(i,j) = real(i+j) b(i,j) = real(j-i) enddo call f_hpmstart(1,"Matrix multiply") do j=1,n do k=1,n do i=1,n enddo c(i,j) = c(i,j) + a(i,k)*b(k,j) enddo enddo call f_hpmstop(1)

N ATIONAL E NERGY R ESEARCH S CIENTIFIC C OMPUTING C ENTER

Comparison of Matrix Multiply

(N1=5,000) Version User Scalar Intrinsic ESSL IMSL NAG ESSL-SMP NAG-SMP Wall Clock(sec) 1,490 1,477 195 194 195 14 14 Mflip/s 168 169 1,280 1,290 1,280 17,800 17,800 Scaled Time (1=Fastest) 106 Slowest 106 13.9

Slowest 13.8

13.9

1.0

1.0

Fastest Fastest

N ATIONAL C ENTER E NERGY R ESEARCH S CIENTIFIC C OMPUTING

Observations on Matrix Multiply

• Fastest times were obtained by the two SMP libraries, ESSL-SMP and NAG-SMP, which both obtained 74% of the peak node performance • All the single processor library functions took 14 times more wall clock time than the SMP versions, each obtaining about 85% of peak for a single processor • Worst times were from the user code and the Fortran intrinsic, which took 100 times more wall clock time than the SMP libraries

N ATIONAL E NERGY R ESEARCH S CIENTIFIC C OMPUTING C ENTER

Comparison of Matrix Multiply

(N2=10,000) Version ESSL-SMP NAG-SMP Wall Clock(sec) 101 100 Mflip/s 19,800 Scaled Time 1.01

19,900 1.00

• Scaling with Problem Size (Complexity increase ~8x) Version ESSL-SMP NAG-SMP Wall Clock(N2/N1) 7.2

7.1

Mflip/s (N2/N1) 1.10

1.12

Both ESSL-SMP and NAG-SMP showed 10% performance gains with the larger problem size.

N ATIONAL C ENTER E NERGY R ESEARCH S CIENTIFIC C OMPUTING

Observations on Scaling

• Scaling of problem size was only done for the SMP libraries, to fit into reasonable times.

• Doubling N results in 8 times increase of computational complexity for dense matrix multiplication • Performance actually increased for both routines for larger problem size.

N ATIONAL C ENTER E NERGY R ESEARCH S CIENTIFIC C OMPUTING

ESSL-SMP Performance vs. Number of Threads • All for N=10,000 • Number of threads controlled by environment variable

OMP_NUM_THREADS

20000 16000 12000 8000 4000 0 0 4 8 12 16 20

Threads

24 28 32 36

N ATIONAL C ENTER E NERGY R ESEARCH S CIENTIFIC C OMPUTING

Parallelism: Choices Based on Problem Size Doing a month’s work in a few minutes!

Three Good Choices • ESSL / LAPACK • ESSL-SMP • SCALAPACK Only beyond a certain problem size is there any opportunity for parallelism.

Matrix-Matrix Multiply

N ATIONAL C ENTER E NERGY R ESEARCH S CIENTIFIC C OMPUTING

Larger Functions: FFT’s

• ESSL, FFTW, NAG, IMSL See

http://www.nersc.gov/nusers/resources/software/libs/math/fft/

• We looked at ESSL, NAG, and IMSL – One-D, forward and reverse

N ATIONAL C ENTER E NERGY R ESEARCH S CIENTIFIC C OMPUTING

One-D FFT’s

• NAG – – – –

c06eaf c06ebf

- forward - inverse, conjugate needed

c06faf c06ebf

- forward, work-space needed - inverse, work-space & conjugate needed • IMSL –

z_fast_dft

- forward & reverse, separate arrays • ESSL –

drcft

- forward & reverse, work-space & initialization step needed • All have size constraints on their data sets

N ATIONAL C ENTER E NERGY R ESEARCH S CIENTIFIC C OMPUTING

One-D FFT Measurement

• 2^24

real*8

data points input (a synthetic signal) • Each transform ran in a 20-iteration loop • All performed both forward and inverse transforms on same data • Input and inverse outputs were identical • Measured with HPMToolkit

second1 = rtc() call f_hpmstart(33,"nag2 forward") do loopind=1, loopceil w(1:n) = x(1:n) end do call c06faf( w, n, work, ifail ) call f_hpmstop(33) second2 = rtc()

NAG

N ATIONAL C ENTER E NERGY R ESEARCH S CIENTIFIC C OMPUTING

One-D FFT Performance

c06eaf C06ebf c06faf

fwd inv fwd 25.182 sec.

24.465 sec.

29.451 sec.

54.006 Mflip/s 40.666 Mflip/s 46.531 Mflip/s

c06ebf

inv 24.469 sec.

40.663 Mflip/s (required a data copy for each iteration for each transform) IMSL

z_fast_dft z_fast_dft

fwd inv ESSL

drcft drcft drcft drcft

init fwd init inv 7 1.479 sec.

71.152 sec.

46.027 Mflip/s 48.096 Mflip/s 0.032 sec .

62.315 Mflip/s 3.573 sec. 274.009 Mflip/s 0.058 sec.

96.384 Mflip/s 3.616 sec. 277.650 Mflip/s

N ATIONAL C ENTER E NERGY R ESEARCH S CIENTIFIC C OMPUTING

N ATIONAL C ENTER E NERGY R ESEARCH S CIENTIFIC C OMPUTING

N ATIONAL C ENTER E NERGY R ESEARCH S CIENTIFIC C OMPUTING

N ATIONAL C ENTER E NERGY R ESEARCH S CIENTIFIC C OMPUTING

ESSL and ESSL-SMP :

“Easy” parallelism ( -qsmp=omp –qessl -lomp –lesslsmp ) • For simple problems can dial in the local data size by adjusting number of threads • Cache reuse can lead to superlinear speed up. • NH II node has 128 MB Cache!

N ATIONAL C ENTER E NERGY R ESEARCH S CIENTIFIC C OMPUTING

Parallelism Beyond One Node : MPI

2D problem in 4 tasks 1 2 3 4 2D problem in 20 tasks 1 3 2 4 • • • • Distributed Data Decomposition Distributed parallelism (MPI) requires both local and global addressing contexts Dimensionality of decomposition can have profound impact on scalability Consider surface to volume ratio Surface = communication (MPI) Volume = local work (HPM) Decomposition is often cause of load imbalance which can reduce parallel efficiency

N ATIONAL C ENTER E NERGY R ESEARCH S CIENTIFIC C OMPUTING

Example FFTW: 3D

• Popular for its portability and performance • Also consider PESSL’s FFTs (not treated here) • Uses a slab (1D) data decomposition • Direct algorithms for transforms of dimensions of size: 1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,32,64 • For parallel FFTW calls transforms are done in place Plan : fftw3d_mpi_create_plan(MPI_COMM_WORLD, nx, ny, nz, FFTW_FORWARD,flags); fftwnd_mpi_local_sizes(plan, &lnx, &lxs, &lnyt, &lyst, &lsize); Transform : fftwnd_mpi(plan, 1, data, work, transform_flags); What are these?

N ATIONAL C ENTER E NERGY R ESEARCH S CIENTIFIC C OMPUTING

FFTW: Data Decomposition

Each MPI rank owns a portion of the problem nx

Local Address Context: Global Address Context: for(x=0;x

ny lxs lxs+lnx nz

for(x=0;x

N ATIONAL C ENTER E NERGY R ESEARCH S CIENTIFIC C OMPUTING

FFTW : Parallel Performance

• • FFT may be complex function of problem size . Prime factors of the dimensions and concurrency determine performance Consider data decompositions and paddings that lead to optimal local data sizes: cache use and prime factors

N ATIONAL C ENTER E NERGY R ESEARCH S CIENTIFIC C OMPUTING

FFTW : Wisdom

• • • • Runtime performance optimization, can be stored to file Wise options: FFTW_MEASURE | FFTW_USE_WISDOM Unwise options: FFTW_ESTIMATE Wisdom works better for serial FFTs. Some benefit for parallel must amortize increase in planning overhead.