Parallel Computing and You

Download Report

Transcript Parallel Computing and You

Parallel Computing and You: A CME213 Review

June 24, 2013 Jason Su

Technologies for C/C++/Fortran

• Single machine, multi-core – P(OSIX) threads: bare metal multi-threading – OpenMP: compiler directives that implement various constructs like parallel-for • Single machine, GPU – CUDA/OpenCL: bare metal GPU coding – Thrust: algorithms library for CUDA inspired by C++ STL • Like MATLAB, where you use a set of fast core functions and data structures to implement your program • Multi-machine – Message Passing Interface (MPI): a language-independent communication protocol to coordinate and program a cluster of machines

Challenges of Parallel Programming

• • • Correctness – Race conditions – Synchronization/deadlock – Floating point arithmetic is not associative or distributive Debugging – How do you fix problems that are not reproducible?

– How do you assess the interaction of many threads running simultaneously?

Performance – Management and algorithm overhead – Amdahl’s Law

S

n threads

  1  1 

P

 

P n threads

Challenges: Race Conditions

As a simple example, let us assume that two threads each want to increment the value of a global integer variable by one.

What we expected: Thread 1 Thread 2

Read Increment Write

Integer Value

0 ← 0 0 → 1 Read Increment Write ← 1 1 → 2

What we got: Thread 1

Read Increment Write

Thread 2

Read Increment Write

Integer Value

0 ← 0 ← 0 0 0 → 1 → 1

Challenges: Deadlock

A real world example would be an illogical statute passed by the Kansas legislature in the early 20th century, which stated: “When two trains approach each other at a crossing, both shall come to a full stop and neither shall start up again until the other has gone.”

Challenges: Deadlock

• Consider a program that manages bank accounts: BankAccount: string owner float balance withdraw(float amount) deposit(float amount) transfer(self, Account to, float amount): lock(self) lock(to) from.withdraw(amount) to.deposit(amount) release(to) release(from)

Challenges: Race Conditions

As a simple example, let us assume that two threads each want to increment the value of a global integer variable by one.

What we expected: Thread 1: Bilbo Thread 2: Frodo

Deposit: Ring Transfer: Ring $1000 Deposit: $1000 → $1000 + Ring ← Transfer: $1000 $1000 Ring

What we got: Thread 1: Bilbo

Deposit: Ring Transfer: Ring

Thread 2: Frodo

Deposit: $1000 →← Transfer: $1000 No, you give me it first, then I’ll give it to you!

Challenges: Floating Point Arithmetic

Say we had a floating point format that has 1 base-10 exponent and 1 decimal place: □ E

What we expected:

 1 E 1 E 1 1    3 E 3 E 0 0    4 E 0 4 E 0    2 E 1 2 E 1

What we got:

 1 E 1 E 1 1    3 E 3 E 0 0    4 E 0 4 E 0   1 E 1  2 E 1

Challenges: Floating Point Arithmetic

• Floating point integers have limited precision – Any operation implicitly rounds, which means that order matters • Float arithmetic is not associative or distributive • Do operations on values of similar precision first – Actually an issue in all code, but becomes apparent when serial and parallel answers differ “Many programmers steeped in sequential programming for so many years make the assumption that there is only one right answer for their algorithm. After all, their code has always delivered the same answer every time it was run…however, all the answers are equally correct”

Challenges: Floating Point Arithmetic

• The answer changes significantly with the number of threads – Because the order of operations has changed – Indicates an unstable numerical algorithm – Parallel programming revealed a flaw in the algorithm which is otherwise unapparent with serial code

OpenMP

• Uses a fork-join programming paradigm

OpenMP

• • Declare private and shared variables for a thread Atomic operations – Ensure that code is uninterrupted by other threads, important for shared variables • Parallel for-loop: see MATLAB’s parfor > #pragma omp parallel for num_threads(nthreads) for (int i=0; i < N; i++) { ... } • Reduction: ℝ n_threads val = v[i]; → ℝ > #pragma omp parallel for reduction(+:val) for (uint i=0; i < v.size(); i++)

• Serial scaling performance has reached its peak.

Moore’s Law

• Processors are not getting faster, but

wider

GPUs

GPUs

CPU vs GPU

• GPU devotes more transistors to data processing

CPU vs GPU

• GPU devotes more transistors to data processing

CPU vs GPU

• • • CPU minimizes time to complete a given task: latency of a thread GPU maximizes number of tasks in a fixed time: throughput of all threads Which is better? It depends on the problem “If you were plowing a field, which would you rather use: two strong oxen or 1024 chicken?”

Seymour Cray

NVIDIA CUDA

Compute Unified Device Architecture (CUDA) • A set of libraries and extensions for C/C++ or Fortran • Also 3rd party support in MATLAB, Python, Java, and Mathematica Language is evolving with the hardware • Newer GPU architectures use different CUDA versions • Denoted by their Compute Capability Hardware originates comes from the desktop market • NVIDIA offers their own expensive products designed for scientific computing (Tesla and Quadro) • But desktop options are affordable and effective (GeForce)

NVIDIA CUDA

Extremely cheap thread creation and switching • Hiding latency with parallelism Direct access to L1 cache (shared memory, up to 48KB) • Fast but up to you to manage it correctly Single instruction multiple thread (SIMT) execution model • May or may not fit your problem

NVIDIA GPU Architecture: Processors

• GPU contains many streaming multiprocessors (MP or SM) – These do the work – Groups threads into “warps” and executes these in lock step – Different warps are swapped in and out constantly to hide latency = “parallelism”

What is a warp?

• • A warp is a group of 32 threads that are executed in lock step (SIMT) Threads in a warp demand obedience, can either: be idle or do the same instruction as its siblings – Divergent code (if/else, loops, return, etc.) can wastefully leave some threads idle

NVIDIA GPU Architecture: Memory

NVIDIA GPU Architecture

• Imagine a pool of threads representing all the work there is to do – For example, a thread for each pixel in the image – These threads need to be broken up into blocks that fit neatly onto each MP, this is chosen by you • There are several limitations that affect how well these blocks fill up an MP, called its occupancy – Total threads < 2048 – Total blocks < 16 – Total shared memory < 48KB – Total registers < 64K

CUDA Program Structure

Host code • void main() • Transfers data to global memory on GPU/device • Determines how the data and threads are divided to fit onto MPs • Calls kernel functions on GPU • kernel_name <<< gridDim, blockDim >>> (arg1, arg2, …);

CUDA Program Structure

• Memory and processors on the GPU exist in big discrete chunks • Designing for GPU is not only developing a parallel algorithm • But also shaping it so that it optimally maps to GPU hardware structures

CUDA Program Structure

Global kernel functions • Device code called by the host • Access data based on threadIdx location • Do work, hopefully in a non divergent way so all threads are doing something Device functions • Device code called by the device • Helper functions usually used in kernels

Connected Component Labeling

Finally, we get to the problem.

Connected Component Labeling

• Say you had a segmented mask of lesions: – How would you count the total number of (connected) lesions in the image?

– How would you track an individual lesion over time?

1 2

 Connected labels share a similar property or state  In this case color, where we’ve simplified to black and white

Stars labeled 1 through 5467

Initialize with Raster-Scan Numbering

Input Image Initial Labels 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 • • Goal is to label each connected region with a unique number, min(a, b) is used to combine labels.

That means 1, 7 , 9, and 54 should remain.

Kernel A – Label Propagation

• Iterate until nothing changes: 1. Assign a pixel for each thread 2. For each thread: a) Look at my neighboring pixels b) If neighbor is smaller than me, re-label myself to the lowest adjacent label 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 K. Hawick, A. Leist, D. Playne, Parallel graph component labelling with GPUs and CUDA, Parallel Computing 36 (12) (2010) 655–678.

Kernel A – Label Propagation

K. Hawick, A. Leist, D. Playne, Parallel graph component labelling with GPUs and CUDA, Parallel Computing 36 (12) (2010) 655–678.

Kernel A – Label Propagation

K. Hawick, A. Leist, D. Playne, Parallel graph component labelling with GPUs and CUDA, Parallel Computing 36 (12) (2010) 655–678.

Kernel A – Label Propagation

• A label can only propagate itself by a maximum of one cell per iteration – Many iterations are required • Very slow for large clusters in the image – We’re getting killed by many accesses to global memory, each taking O(100) cycles – Even having many parallel threads is not enough to hide it, need 32 cores/MP*400 cycles = 12,800 threads active/MP K. Hawick, A. Leist, D. Playne, Parallel graph component labelling with GPUs and CUDA, Parallel Computing 36 (12) (2010) 655–678.

Kernel B – Local Label Propagation

• • Take advantage of L1-speed shared memory Iterate until global convergence: 1.

2.

Assign a pixel for each thread For each thread: a) Load myself and neighbor labels into shared memory b) Iterate until local convergence: i.

ii.

Look at my neighboring pixels If neighbor is smaller than me, re-label myself to the lowest adjacent label K. Hawick, A. Leist, D. Playne, Parallel graph component labelling with GPUs and CUDA, Parallel Computing 36 (12) (2010) 655–678.

Kernel B – Local Label Propagation

Initial Labels 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64

Kernel B – Local Label Propagation

K. Hawick, A. Leist, D. Playne, Parallel graph component labelling with GPUs and CUDA, Parallel Computing 36 (12) (2010) 655–678.

Kernel B – Local Label Propagation

K. Hawick, A. Leist, D. Playne, Parallel graph component labelling with GPUs and CUDA, Parallel Computing 36 (12) (2010) 655–678.

Kernel B – Local Label Propagation

• A label can only propagate itself by a maximum of one block per iteration – Iterations are reduced but still can be a lot • Tiling the data into shared memory blocks is a very common technique – Appears in many applications, including FDTD

Kernel D – Label Equivalence

• Completely change the algorithm – How do we make it so labels can propagate extensively in 1 iteration?

– Track which labels are equivalent and resolve equivalency chains Iterate until convergence:

Scanning

• Find minimum neighbor for each pixel and record it in equivalence list

Analysis

• Traverse through equivalency chain until you get a label that is equivalent to itself Relabel the pixels

Kernel D – Label Equivalence

K. Hawick, A. Leist, D. Playne, Parallel graph component labelling with GPUs and CUDA, Parallel Computing 36 (12) (2010) 655–678.

CPU – MATLAB’s bwlabel

Run-length

encode image • There are special hardware instructions that make this very fast Scan the runs • Assign initial labels and record label equivalences in an equivalence table.

Analysis

• Here the chains are traversed to the actual minimum instead of a self reference • This is possible because the code is run in serial Relabel the runs

Input

477x423

Performance

Solution

23 labels

CPU: Matlab’s bwlabel

1.23

ms

GPU: Kernel A

92.9

ms

GPU: Kernel B GPU: Kernel D

14.3

ms 1.05

ms 1463x2233 68 labels 15.2

ms 5031.1

ms 491.3

ms 12.3

ms

Results: CPU vs GPU GPU isn’t a cure-all

• A naïve algorithm will still lose to CPU, badly • Starting from the best serial algorithm is smart • Shared memory can offer a 10x improvement

Connected component labeling

• Actually is a more challenging problem for GPU to beat CPU • Memory access bound problem • Does not take advantage of the TFLOPs available

Figure Credits

• • • • Christopher Cooper, Boston University Eric Darve, Stanford University Gordon Erlebacher, Florida State University CUDA C Computing Guide