Transcript MATlab
Outline
Speeding up Matlab Computations
Symmetric Multi-Processing with Matlab
Accelerating Matlab computations with GPUs
Running Matlab in distributed memory environments
Using the Parallel Computing Toolbox
Using the Matlab Distributed Compute Engine Server
Using pMatlab
Mixing Matlab and Fortran/C code
Compiling MEX code from C/Fortran
Turning Matlab routines into C code
Symmetric Multi-Processing
By default Matlab uses all cores on a given node for operations that can be threaded
Arrays and matrices • Basic information: ISFINITE, ISINF, ISNAN, MAX, MIN • Operators: +, -, .*, ./, .\, .^, *, ^, \ (MLDIVIDE), / (MRDIVIDE) • Array op
Symmetric Multi-Processing
To be sure you only use the resources you request, you should either
request an entire node and all of the CPU’s...
salloc –t 60 -c 24
module load matlab
srun matlab
Or request a single cpu and specify that Matlab should only use a single
thread
salloc –t 60
module load matlab
srun matlab -singleCompThread
Using GPUs with Matlab
Matlab can use GPUs to do calculations, provided a GPU is available on
the node Matlab is running on.
salloc -t 60 --gres=gpu:1
module load matlab
module load cuda
matlab
We can query the connected GPUs from within Matlab using
gpuDeviceCount()
gpuDevice()
And obtain a list of GPU supported functions using
methods('gpuArray')
Using GPUs with Matlab
So there is a 2D FFT – but no Hilbert function...
H=hilb(1000);
H_=gpuArray(H);
Z_=fft2(H_);
Z=gather(Z_);
imagesc(log(abs(Z)));
Distribute data to GPU
FFT performed on GPU
Gather data from GPU onto CPU
We could do the log and abs functions on the GPU as well.
H=hilb(1000);
H_=gpuArray(H);
Z_=fft2(H_);
imagesc(gather(log(abs(Z_)));
Using GPUs with Matlab
For our example, doing the FFT on the GPU is 7x faster. (4x if you include
moving the data to the GPU and back)
>> H=hilb(5000);
>> tic; A=gather(gpuArray(H)); toc
Elapsed time is 0.161166 seconds.
>> tic; A=gather(fft2(gpuArray(H))); toc
Elapsed time is 0.348159 seconds.
>> tic; A=fft2(H); toc
Elapsed time is 1.210464 seconds.
Using GPUs with Matlab
Matlab has no built in hilb() function that can run on the GPU – but we
can write our own function(kernel) in cuda – and save it to hilbert.cu
__global__ void HilbertKernel( double * const out, size_t const numRows, size_t const numCols)
{
const int rowIdx = blockIdx.x * blockDim.x + threadIdx.x;
const int colIdx = blockIdx.y * blockDim.y + threadIdx.y;
if ( rowIdx >= numRows ) return;
if ( colIdx >= numCols ) return;
size_t linearIdx = rowIdx + colIdx*numRows;
out[linearIdx] = 1.0 / (double)(1+rowIdx+colIdx) ;
}
And compile it with nvcc to generate a Parallel Thread eXecution file
nvcc -ptx hilbert.cu
Using GPUs with Matlab
We have to initialize the kernel and specify the grid size before executing
the kernel.
H_=gpuArray.zeros(1000);
hilbert_kernel=parallel.gpu.CUDAKernel('hilbert.ptx','hilbert.cu');
hilbert_kernel.GridSize=size(H_);
H_=feval(hilbert_kernel, H_, 1000,1000);
Z_=fft2(H_);
imagesc(gather(log(abs(Z_))));
The default for matlab kernel’s is 1 thread per block, but we could create
fewer blocks that were each 10 x 10 threads.
hilbert_kernel.ThreadBlockSize=[10,10,1];
hilbert_kernel.GridSize=[100,100];
Parallel Computing Toolbox
As an alternative you can also use the Parallel Computing Toolbox. This
supports parallelism via MPI
https://info.circ.rochester.edu/BlueHive/Software/Mathematics/matlab.html
You should create a pool that is the same size as the number of processors
you requested in your job submission. Matlab also sells licenses for using a
Distributed Computing Server which allows for matlabpools that use more
than just the local node.
Parallel Computing Toolbox
•
You can achieve parallelism in several ways:
•
parfor loops – execute for loops in parallel
•
smpd – execute instructions in parallel (using ‘labindex’ or ‘drange’)
•
pmode – interactive version of smpd
•
distributed arrays – very similar to gpuArrays.
Parallel Computing Toolbox
•
You can achieve parallelism in several ways:
•
parfor loops – execute for loops in parallel
•
smpd – execute instructions in parallel (using ‘labindex’ or ‘drange’)
•
pmode – interactive version of smpd
•
distributed arrays – very similar to gpuArrays.
matlabpool(4)
parfor n=1:100
H=hilb(n);
Z=fft2(H);
f=figure('Visible','off'); imagesc(log(abs(Z)));
print('-dpdf','-r300', sprintf('%s%03d%s','fig1-batch_',n,'.pdf'));
end
matlabpool close
Parallel Computing Toolbox
•
You can achieve parallelism in several ways:
•
parfor loops – execute for loops in parallel
•
smpd – execute instructions in parallel (using ‘labindex’ or ‘drange’)
•
pmode – interactive version of smpd
•
distributed arrays – very similar to gpuArrays.
matlabpool(4)
spmd
for n=drange(1:100)
H=hilb(n);
Z=fft2(H);
f=figure('Visible','off');
imagesc(log(abs(Z)));
end
end
matlabpool close
matlabpool(4)
spmd
for n=labindex:numlabs:100
H=hilb(n);
Z=fft2(H);
f=figure('Visible','off');
imagesc(log(abs(Z)));
end
end
matlabpool close
Parallel Computing Toolbox
•
You can achieve parallelism in several ways:
•
parfor loops – execute for loops in parallel
•
smpd – execute instructions in parallel (using ‘labindex’ or ‘drange’)
•
pmode – interactive version of smpd
•
distributed arrays – very similar to gpuArrays.
pmode start 4
n=labindex;
H=hilb(n);
Z=fft2(H);
f=figure('Visible','off'); imagesc(log(abs(Z)));
print('-dpdf','-r300', sprintf('%s%03d%s','fig1-batch_',n,'.pdf'));
pmode lab2client H 3 H3
H3
pmode close
Parallel Computing Toolbox
•
You can achieve parallelism in several ways:
•
parfor loops – execute for loops in parallel
•
smpd – execute instructions in parallel (using ‘labindex’ or ‘drange’)
•
pmode – interactive version of smpd
•
distributed arrays – very similar to gpuArrays.
Example using distributed arrays
Example using gpuArray
H=hilb(1000);
H_=gpuArray(H);
Z_=fft2(H_);
Z=gather(Z_);
imagesc(log(abs(Z)));
matlabpool(8)
H=hilb(1000);
H_=distributed(H);
Z_=fft(fft(H_,[],1),[],2);
Z=gather(Z_);
imagesc(log(abs(Z)));
matlabpool close
Parallel Computing Toolbox
What about building hilbert matrix in parallel?
matlabpool(4)
spmd
Define partition
codist=codistributor1d(1,[250,250,250,250],[1000,1000]);
Get local indices in x-direction
[i_lo, i_hi]=codist.globalIndices(1);
Allocate space for local part
H_local=zeros(250,1000);
for i=i_lo:i_hi
Initialize local array with
for j=1:1000
Hilbert values.
H_local(i-i_lo+1,j)=1/(i+j-1);
end
Assemble codistributed array
end
Now it's distributed like before!
H_ = codistributed.build(H_local, codist);
end
Z_=fft(fft(H_,[],1),[],2);
Z=gather(Z_);
imagesc(log(abs(Z)));
matlabpool close
Using pMatlab
pMatlab is an alternative method to get distributed matlab functionality
without relying on Matlab’s Distributed Computing Server.
It is built on top of MapMPI (an MPI implementation for matlab –
written in matlab - that uses file I/O for communication)
It supports various operations on distributed arrays (up to 4D)
Remapping, aggregating, finding non-zero entries, transposing, ghosting
Elementary math functions (trig, exponential, complex, remainder/rounding)
2D Convolutions, FFTs, Discrete Cosine Transform
FFT's
need to be properly mapped (cannot be distributed along transform dimension).
It does not have as much functionality as the parallel computing
toolbox – but it does support ghosting and more flexible partitioning!
Using pMatlab
Since pMatlab works by launching other Matlab instances – we need
them to startup with pMatlab functionality. To do so we need to add a few
lines to our startup.m file in our matlab path.
addpath('/software/pMatlab/MatlabMPI/src');
addpath('/software/pMatlab/src');
rehash;
pMatlabGlobalsInit;
Running pMatlab in Batch Mode
To submit a job in batch mode we need to create a batch script
#SBATCH -N Matlab
#SBATCH -p standard
#SBATCH –t 60
#SBATCH –N 2
#SBATCH --ntasks-per-node=8
module load matlab
matlab -nodisplay -r "pmatlab_launcher"
sample_script.pbs
And a Matlab script to launch the pMatlab script
nProcs=getenv('SLURM_NTASKS);
[sreturn, machines]=system('nodelist');
machines=regexp(machines, '\n', 'split');
machines=machines(1:size(machines,2)-1);
eval(pRUN('pmatlab_script',nProcs,machines));
pmatlab_launcher.m
Running pMatlab in Batch Mode
And finally we have our pmatlab script.
Xmap=map([Np 1],{},0:Np-1);
H_=zeros(1000,1000,Xmap);
[I1,I2]=global_block_range(H_);
H_local=zeros(I1(2)-I1(1)+1,I2(2)-I2(1)+1);
for i=I1(1):I1(2)
for j=I2(1):I2(2)
H_local(i-I1(1)+1,j-I2(1)+1)=1/(i+j-1);
end
end
H_=put_local(H_,H_local);
Z_=fft(fft(H_,[],2),[],1);
Z=agg(Z_);
if (pMATLAB.my_rank == pMATLAB.leader)
f=figure('Visible','off');
imagesc(log(abs(Z)));
print('-dpdf','-r300', 'fig1.pdf');
end
map for distributing array
Distributed matrix constructor
Indices for local portion of array
Allocate and populate local
portion of array with Hilbert
matrix values
X = local
put_local(X,
fft(local(X),[],2));
Copy
values into
distributed array
Do
and do x-fft. Z_ has different map
Z y-fft
= transpose_grid(X);
Collect
resulting matrix
onto 'leader'
Z = put_local(Z,
fft(local(Z),[],1));
Plot result from 'leader'
matlab
pmatlab_script.m
process
Compiling Mex Code
C, C++, or Fortran routines can be called from within Matlab.
#include "fintrf.h"
subroutine mexfunction(nlhs, plhs, nrhs, prhs)
mwPointer :: plhs(*), prhs(*)
integer :: nlhs, nrhs
mwPointer :: mxGetPr
mwPointer :: mxCreateDoubleMatrix
real(8) :: mxGetScalar
mwPointer :: pr_out
integer :: n
n = nint(mxGetScalar(prhs(1)))
plhs(1) = mxCreateDoubleMatrix(n,n, 0)
pr_out = mxGetPr(plhs(1))
call compute(%VAL(pr_out),n)
end subroutine mexfunction
subroutine compute(h, n)
integer :: n
real(8) :: h(n,n)
integer :: i,j
do i=1,n
do j=1,n
h(i,j)=1d0/(i+j-1d0)
end do
end do
end subroutine compute
mex hilbert.F90
>> H=hilbert(10)
Turning Matlab code into C
First we create a log_abs_fft_hilb.m function
function result = log_abs_fft_hilb(n)
result=log(abs(fft2(hilb(n))));
And then we run
>> codegen log_abs_fft_hilb.m –args {uint32(0)}
This will produce a mex file that we can test.
>> A=log_abs_fft_hilb_mex(uint32(16));
>> B=log_abs_fft_hilb(16);
>> max(max(abs(A-B)))
ans = 8.8818e-16
We could have specified the type of 'n' in our matlab function
function result = log_abs_fft_hilb(n)
assert(isa(n,'uint32'));
result=log(abs(fft2(hilb(n))));
Turning Matlab code into C
Now we can also export a static library that we can link to:
>> codegen log_abs_fft_hilb.m -config coder.config('lib') -args {'uint32(0)'}
This will create a subdirectory codegen/lib/log_abs_fft_hilb that will
have the source files '.c and .h' as well as a compiled object files '.o' and a
library 'log_abs_fft_hilb.a'
The source files are portable to any platform with a 'C' compiler (ie
BlueStreak). We can rebuild the library on BlueStreak by running
mpixlc –c *.c
ar rcs log_abs_fft_hilb.a *.o
Turning Matlab code into C
To use the function, we still need to write a main subroutine that links to it. This requires
working with matlab's variable types (which include dynamically resizable arrays)
#include "stdio.h"
#include "rtwtypes.h"
#include "log_abs_fft_hilb_types.h"
Matlab type definitions
void main() {
uint32_T n=64;
Argument to Matlab function
emxArray_real_T *result;
Return value of Matlab function
int32_T i,j;
emxInit_real_T(&result, 2);
Initialize Matlab array to have rank 2
log_abs_fft_hilb(n, result);
Call matlab function
for(i=0;i<result->size[0];i++) {
for(j=0;j<result->size[1];j++) {
printf("%f ",result->data[i+result->size[0]*j]);
Output result in column
}
major order
printf("\n");
}
emxFree_real_T(&result);
Free up memory associated with return array
}
Turning Matlab code into C
And here is another example of calling 2D fft's on real data
void main() {
int32_T q0;
int32_T i;
int32_T n=8;
emxArray_creal_T *result;
emxArray_real_T *input;
emxInit_creal_T(&result, 2);
emxInit_real_T(&input, 2);
q0 = input->size[0] * input->size[1];
input->size[0]=n;
input->size[1]=n;
emxEnsureCapacity((emxArray__common *)input,
q0, (int32_T)sizeof(real_T));
for(j=0;j<input->size[1];j++ {
for(i=0;i<input->size[0];i++) {
input->data[i+input->size[0]*j]=1.0 / (real_T)(i+j+1);
}
}
my_fft(input, result);
for(i=0;i<result->size[0];i++) {
for(j=0;j<result->size[1];j++) {
printf("[% 10.4f,% 10.4f] ",
result->data[i+result->size[0]*j].re,
result->data[i+result->size[0]*j].im);
}
printf("\n");
}
emxFree_creal_T(&result);
emxFree_real_T(&input);
}
Turning Matlab code into C
Exported FFT's only work on vectors of length 2N
Error checking is disabled in exported C code
Mex code will have the same functionality as exported C code, but will
also have error checking. It will warn about doing FFT's on arbitrary length
vectors, etc...
Always test your mex code!
Matlab code is not that different from C code
#include <stdio.h>
#include <math.h>
#include <complex.h>
#include <fftw3.h>
void main() {
int n=4096;
int i,j;
double complex temp[n][n], input[n][n];
double result[n][n];
fftw_plan p;
p=fftw_plan_dft_2d(n, n, &input[0][0], &temp[0][0],
FFTW_FORWARD, FFTW_ESTIMATE);
for (i=0;i<n;i++){
for(j=0;j<n;j++) {
input[i][j]=(double complex)(1.0/(double)(i+j+1));
}
}
fftw_execute(p);
for (i=0;i<n;i++){
for(j=0;j<n;j++) {
result[i][j]=log(cabs(temp[i][j]));
}
}
for (i=0;i<n;i++){
for(j=0;j<n;j++) {
printf("%f ",result[i][j]);
}
}
fftw_destroy_plan(p);
}
Or you can write your own 'C'
code that uses open source
mathematical libraries (ie fftw).