Speeding up MATLAB Applications

Download Report

Transcript Speeding up MATLAB Applications

Parallel computing with MATLAB
© 2012 The MathWorks, Inc.1
Going Beyond Serial MATLAB Applications
MATLAB
Desktop (Client)
Worker
Worker
Worker
Worker
Worker
Worker
2
Programming Parallel Applications (CPU)
Ease of Use
Greater Control
 Built-in support with toolboxes
3
Example: Optimizing Cell Tower Position
Built-in parallel support

With Parallel Computing Toolbox
use built-in parallel algorithms in
Optimization Toolbox

Run optimization in parallel

Use pool of MATLAB workers
4
Tools Providing Parallel Computing Support







Optimization Toolbox
Global Optimization Toolbox
Statistics Toolbox
Signal Processing Toolbox
Neural Network Toolbox
Image Processing Toolbox
…
BLOCKSETS
Directly leverage functions in Parallel Computing Toolbox
www.mathworks.com/builtin-parallel-support
5
Agenda




Task parallel applications
GPU acceleration
Data parallel applications
Using clusters and grids
6
Independent Tasks or Iterations



Ideal problem for parallel computing
No dependencies or communications between tasks
Examples: parameter sweeps, Monte Carlo simulations
Time
Time
blogs.mathworks.com/loren/2009/10/02/using-parfor-loops-getting-up-and-running/
7
Example: Parameter Sweep of ODEs
Parallel for-loops

Parameter sweep of ODE system
– Damped spring oscillator
– Sweep through different values
of damping and stiffness
– Record peak value for each
simulation

Convert for to parfor

Use pool of MATLAB workers
5
m x  b x  k x  0
1, 2 ,...
1, 2 ,...
8
The Mechanics of parfor Loops
1
1
2
23 34 4 55 66
7
8 8 9 910 10
Worker
a(i) = i;
a = zeros(10, 1)
parfor i = 1:10
a(i) = i;
end
a
Worker
a(i) = i;
Worker
a(i) = i;
Worker
a(i) = i;
Pool of MATLAB Workers
9
Agenda




Task parallel applications
GPU acceleration
Data parallel applications
Using clusters and grids
10
What is a Graphics Processing Unit (GPU)

Originally for graphics acceleration, now
also used for scientific calculations

Massively parallel array of integer and
floating point processors
– Typically hundreds of processors per card
– GPU cores complement CPU cores

Dedicated high-speed memory
* Parallel Computing Toolbox requires NVIDIA GPUs with Compute Capability 1.3 or
higher, including NVIDIA Tesla 20-series products. See a complete listing at
www.nvidia.com/object/cuda_gpus.html
11
Performance Gain with More Hardware
Using More Cores (CPUs)
Core 1
Using GPUs
Core 2
GPU cores
Core 3
Core 4
Cache
Device
Device Memory
Memory
12
Example: Mandelbrot set

The color of each pixel
is the result of hundreds
or thousands or
iterations

Each pixel is
independent of the other
pixels

Hundres of thousands of
pixels
13
Real-world performance increase
Solving a wave equation
Grid Size
CPU
(s)
GPU
(s)
Speedup
64 x 64
0.1004
0.3553
0.28
128 x 128
0.1931
0.3368
0.57
256 x 256
0.5888
0.4217
1.4
512 x 512
2.8163
0.8243
3.4
1024 x 1024
13.4797
2.4979
5.4
2048 x 2048
74.9904
9.9567
7.5
Intel Xeon Processor X5650, NVIDIA Tesla C2050 GPU
14
Programming Parallel Applications (GPU)
 Built-in support with toolboxes
Ease of Use
gpuArray, gather
 Advanced programming constructs:
arrayfun, spmd
Greater Control
 Simple programming constructs:
 Interface for experts:
CUDAKernel, MEX support
www.mathworks.com/help/distcomp/run-cuda-or-ptx-code-on-gpu
www.mathworks.com/help/distcomp/run-mex-functions-containing-cuda-code
15
Agenda




Task parallel applications
GPU acceleration
Data parallel applications
Using clusters and grids
16
Big data: Distributed Arrays
11 26 41
12 27 42
13 28 43
14 29 44
15 30 45
16 31 46
TOOLBOXES
17 32 47
BLOCKSETS
19 34 49
17 33 48
20 35 50
21 36 51
22 37 52
Remotely Manipulate Array
from Desktop
Distributed Array
Lives on the Cluster
17
Big Data: Distributed Arrays
Worker
Column 1:3 of y
Worker
Column 4:6 of y
y = distributed(rand(10));
Worker
Column 7:8 of y
Worker
Column 9:10 of y
Pool of MATLAB Workers
18
Demo: Approximation of π
1
0
4
𝑑𝑥 = 𝜋
2
1+𝑥
19
Programming Parallel Applications (CPU)
Ease of Use
 Simple programming constructs:
parfor, batch, distributed
Greater Control
 Built-in support with toolboxes
 Advanced programming constructs:
createJob, labSend, spmd
20
Agenda




Task parallel applications
GPU acceleration
Data parallel applications
Using clusters and grids
21
Working on C3SE
22
Apply for a project with SNIC
23
© 2012 The MathWorks, Inc.
24