GPU acceleration in Matlab - the Department of Image Processing

Download Report

Transcript GPU acceleration in Matlab - the Department of Image Processing

GPU acceleration in Matlab
Jan Kamenický
UTIA Friday seminar
9.11.2012
GPU acceleration
• CPU
– fast
– general-purpose
• GPU
– highly parallel
– handles specific tasks with large amount of data
– memory transfers needed
GPU acceleration in Matlab
• Build-in functions
– many Matlab functions support GPU acceleration
natively
• arrayfun
– specific element-wise processing
• CUDA kernels
– write “.cu” files
– compile to “.ptx” (parallel thread execution)
– run using feval
Prerequisites
• Matlab 2010b or newer
• Parallel Computing Toolbox
ver
Prerequisites
>> ver
------------------------------------------------------------------------------------MATLAB Version 7.13.0.564 (R2011b)
MATLAB License Number: XXXXXX
Operating System: Microsoft Windows 7 Version 6.1 (Build 7601: Service Pack 1)
Java VM Version: Java 1.6.0_17-b04 with Sun Microsystems Inc. Java HotSpot(TM) 64-Bit Server VM mixed mode
------------------------------------------------------------------------------------MATLAB
Version 7.13
(R2011b)
Simulink
Version 7.8
(R2011b)
Computer Vision System Toolbox
Version 4.1
(R2011b)
Curve Fitting Toolbox
Version 3.2
(R2011b)
DSP System Toolbox
Version 8.1
(R2011b)
Data Acquisition Toolbox
Version 3.0
(R2011b)
Filter Design HDL Coder
Version 2.9
(R2011b)
Fixed-Point Toolbox
Version 3.4
(R2011b)
Global Optimization Toolbox
Version 3.2
(R2011b)
Image Acquisition Toolbox
Version 4.2
(R2011b)
Image Processing Toolbox
Version 7.3
(R2011b)
MATLAB Compiler
Version 4.16
(R2011b)
MATLAB Distributed Computing Server
Version 5.2
(R2011b)
Neural Network Toolbox
Version 7.0.2
(R2011b)
Optimization Toolbox
Version 6.1
(R2011b)
Parallel Computing Toolbox
Version 5.2
(R2011b)
Partial Differential Equation Toolbox
Version 1.0.19
(R2011b)
Signal Processing Toolbox
Version 6.16
(R2011b)
Simulink 3D Animation
Version 6.0
(R2011b)
Statistics Toolbox
Version 7.6
(R2011b)
Symbolic Math Toolbox
Version 5.7
(R2011b)
Wavelet Toolbox
Version 4.8
(R2011b)
Prerequisites
>> ver
------------------------------------------------------------------------------------MATLAB Version 7.13.0.564 (R2011b)
MATLAB License Number: XXXXXX
Operating System: Microsoft Windows 7 Version 6.1 (Build 7601: Service Pack 1)
Java VM Version: Java 1.6.0_17-b04 with Sun Microsystems Inc. Java HotSpot(TM) 64-Bit Server VM mixed mode
------------------------------------------------------------------------------------MATLAB
Version 7.13
(R2011b)
Simulink
Version 7.8
(R2011b)
Computer Vision System Toolbox
Version 4.1
(R2011b)
Curve Fitting Toolbox
Version 3.2
(R2011b)
DSP System Toolbox
Version 8.1
(R2011b)
Data Acquisition Toolbox
Version 3.0
(R2011b)
Filter Design HDL Coder
Version 2.9
(R2011b)
Fixed-Point Toolbox
Version 3.4
(R2011b)
Global Optimization Toolbox
Version 3.2
(R2011b)
Image Acquisition Toolbox
Version 4.2
(R2011b)
Image Processing Toolbox
Version 7.3
(R2011b)
MATLAB Compiler
Version 4.16
(R2011b)
MATLAB Distributed Computing Server
Version 5.2
(R2011b)
Neural Network Toolbox
Version 7.0.2
(R2011b)
Optimization Toolbox
Version 6.1
(R2011b)
Parallel Computing Toolbox
Version 5.2
(R2011b)
Partial Differential Equation Toolbox
Version 1.0.19
(R2011b)
Signal Processing Toolbox
Version 6.16
(R2011b)
Simulink 3D Animation
Version 6.0
(R2011b)
Statistics Toolbox
Version 7.6
(R2011b)
Symbolic Math Toolbox
Version 5.7
(R2011b)
Wavelet Toolbox
Version 4.8
(R2011b)
Prerequisites
• Matlab 2010b or newer
• Parallel Computing Toolbox
ver
• NVIDIA GPU with CUDA version 1.3 or higher
gpuDevice
Prerequisites
>> gpuDevice
ans =
parallel.gpu.CUDADevice handle
Package: parallel.gpu
Properties:
Name:
Index:
ComputeCapability:
SupportsDouble:
DriverVersion:
MaxThreadsPerBlock:
MaxShmemPerBlock:
MaxThreadBlockSize:
MaxGridSize:
SIMDWidth:
TotalMemory:
FreeMemory:
MultiprocessorCount:
ClockRateKHz:
ComputeMode:
GPUOverlapsTransfers:
KernelExecutionTimeout:
CanMapHostMemory:
DeviceSupported:
DeviceSelected:
'GeForce GTX 285'
1
'1.3'
1
5
512
16384
[512 512 64]
[65535 65535]
32
2.1475e+009
1.9656e+009
30
1476000
'Default'
1
1
1
1
1
Methods, Events, Superclasses
Prerequisites
>> gpuDevice
ans =
parallel.gpu.CUDADevice handle
Package: parallel.gpu
Properties:
Name:
Index:
ComputeCapability:
SupportsDouble:
DriverVersion:
MaxThreadsPerBlock:
MaxShmemPerBlock:
MaxThreadBlockSize:
MaxGridSize:
SIMDWidth:
TotalMemory:
FreeMemory:
MultiprocessorCount:
ClockRateKHz:
ComputeMode:
GPUOverlapsTransfers:
KernelExecutionTimeout:
CanMapHostMemory:
DeviceSupported:
DeviceSelected:
'GeForce GTX 285'
1
'1.3'
1
5
512
16384
[512 512 64]
[65535 65535]
32
2.1475e+009
1.9656e+009
30
1476000
'Default'
1
1
1
1
1
Methods, Events, Superclasses
Basic usage
• Send data to GPU
– either allocate there or transfer from workspace
• Run Matlab functions
– GPU acceleration is used automatically
• Retrieve the output data
GPUArray class
parallel.gpu.GPUArray
– main data class for GPU computations
– stored in the GPU memory
– create directly using static methods
zeros
nan
eye
rand
linspace
ones
true
colon
randi
logspace
inf
false
– copy from existing data
gpuArray(img)
randn
GPUArray class
• Supported data types:
(u)int8, (u)int16, (u)int32, (u)int64, single, double,
logical
– determine the type using
classUnderlying(gpuVar)
• Retrieve the data using
workspaceVar = gather(gpuVar)
GPU accelerated Matlab functions (2012b)
methods(‘parallel.gpu.GPUArray’)
GPU accelerated Matlab functions (2012b)
abs
acos
acosh
acot
acoth
acsc
acsch
all
angle
any
arrayfun
asec
asech
asin
asinh
atan
atan2
atanh
beta
betaln
bitand
bitcmp
bitget
bitor
bitset
bitshift
bitxor
blkdiag
bsxfun
cast
cat
ceil
chol
circshift
classUnderlying
colon
complex
cond
conj
conv
conv2
convn
cos
cosh
cot
coth
cov
cross
csc
csch
ctranspose
cumprod
cumsum
det
diag
diff
disp
display
dot
double
eig
eps
eq
erf
erfc
erfcinv
erfcx
erfinv
exp
expm1
fft
fft2
fftn
fftshift
filter
filter2
find
fix
fliplr
flipud
flipdim
floor
fprintf
full
gamma
gammaln
gather
ge
gt
horzcat
hypot
ifft
ifft2
ifftn
ifftshift
imag
ind2sub
int16
int2str
int32
int64
int8
inv
ipermute
iscolumn
isempty
isequal
isequaln
isfinite
isinf
islogical
ismatrix
isnan
isreal
isrow
issorted
issparse
isvector
kron
ldivide
le
length
log
log10
log1p
log2
logical
lt
lu
mat2str
max
mean
meshgrid
min
minus
mldivide
mod
mpower
mrdivide
mtimes
ndgrid
ndims
ne
nnz
norm
normest
not
num2str
numel
perms
permute
plot (and related)
plus
pow2
power
prod
qr
rank
rdivide
real
reallog
realpow
realsqrt
rem
repmat
reshape
rot90
round
sec
sech
shiftdim
sign
sin
single
sinh
size
sort
sprintf
sqrt
squeeze
std
sub2ind
subsasgn
subsindex
subsref
sum
svd
tan
tanh
times
trace
transpose
tril
triu
uint16
uint32
uint64
uint8
uminus
uplus
var
vertcat
Simple example
• Solve system of linear equations (Ax = b)
A
b
x
x
=
=
=
=
gpuArray(A);
gpuArray(b);
A\b;
gather(x);
Simple example
• Compute convolution using FFT
img
msk
msk
I =
M =
res
res
= gpuArray(img);
= padarray(msk,size(img)-size(msk),0,'post');
= gpuArray(msk);
fft2(img);
fft2(msk,size(img,1),size(img,2));
fft2(msk);
= real(ifft2(I.*M));
= gather(res);
Linear system solution benchmark
Speedup of computations on GPU compared to CPU
3.5
3
Speedup
2.5
2
1.5
single-precision
double-precision
1
0.5
0
Matrix size (number of equations)
Convolution benchmark
Speedup of computations on GPU compared to CPU
5
4.5
4
Speedup
3.5
3
2.5
2
1.5
single-precision
1
double-precision
0.5
0
Matrix size
Profiling
• Before optimizing (trying to use GPU) locate
promising parts of code like
– custom code consuming the majority of time
– build-in functions that support GPUArray
(consuming the majority of time)
– large input/output data, simple data types
• Test the speed afterwards
• GPU code cannot be profiled
Profiling