CUDA Lecture 1 Introduction to Massively Parallel Computing

Transcript CUDA Lecture 1 Introduction to Massively Parallel Computing

Prepared 7/28/2011 by T. O’Neil for 3460:677, Fall 2011, The University of Akron.

 Conventional Von Neumann architecture consists of a processor executing a program stored in a (main) memory:  Each main memory location located by its address. Addresses start at zero and extend to 2

– 1 when there are n bits (binary digits) in the address.

Parallel Architectures and Performance Analysis – Slide 2

 Parallel computer: multiple-processor system supporting parallel programming.

 Three principle types of architecture    Vector computers, in particular processor arrays Shared memory multiprocessors  Specially designed and manufactured systems Distributed memory multicomputers  Message passing systems readily formed from a cluster of workstations Parallel Architectures and Performance Analysis – Slide 3

 Vector computer: instruction set includes operations on vectors as well as scalars  Two ways to implement vector computers   Pipelined vector processor (e.g. Cray): streams data through pipelined arithmetic units Processor array: many identical, synchronized arithmetic processing elements Parallel Architectures and Performance Analysis – Slide 4

 Historically, high cost of a control unit  Scientific applications have data parallelism Parallel Architectures and Performance Analysis – Slide 5

 Front end computer (standard uniprocessor)  Program  Data manipulated sequentially  Processor array (individual processor/memory pairs)   Data manipulated in parallel Performance  Speed of processing elements   Utilization of processing elements Size of data structure Parallel Architectures and Performance Analysis – Slide 6

 Each VLSI chip has 16 processing elements Parallel Architectures and Performance Analysis – Slide 7

 Not all problems are data parallel  Speed drops for conditionally executed code  Do not adapt to multiple users well  Do not scale down well to “starter” systems  Rely on custom VLSI for processors  Expense of control units has dropped Parallel Architectures and Performance Analysis – Slide 8

 Natural way to extend single processor model  Have multiple processors connected to multiple memory modules such that each processor can access any memory module  So-called shared memory configuration: Parallel Architectures and Performance Analysis – Slide 9

Parallel Architectures and Performance Analysis – Slide 10

 Any memory location can be accessible by any of the processors.

 A

single address space

exists, meaning that each memory location is given unique address within a single range of addresses.

 Generally, shared memory programming more convenient although it does require access to shared data to be controlled by the programmer (using critical sections, etc.).

Parallel Architectures and Performance Analysis – Slide 11

 Alternately known as a tightly coupled architecture.

 No local memory associated with processors.

 Avoid three problems of processor arrays    Can be built from commodity CPUs Naturally support multiple users Maintain efficiency in conditional code Parallel Architectures and Performance Analysis – Slide 12

 Several alternatives for programming shared memory multiprocessors   Using threads (pthreads, Java, …) in which the programmer decomposes the program into individual parallel sequences, each being a thread, and each being able to access variables declared outside the threads.

Using a sequential programming language with user level libraries to declare and access shared variables.

Parallel Architectures and Performance Analysis – Slide 13

 Several alternatives for programming shared memory multiprocessors  Using a sequential programming language with preprocessor compiler directives to declare shared variables and specify parallelism.

 Ex: OpenMP – the industry standard  An API for shared-memory systems  Supports higher performance parallel programming of symmetrical multiprocessors Parallel Architectures and Performance Analysis – Slide 14

 Several alternatives for programming shared memory multiprocessors    Using a parallel programming language with syntax for parallelism, in which the compiler creates the appropriate executable code for each processor.

Using a sequential programming language and ask a parallelizing compiler to convert it into parallel executable code.

Neither of these not now common.

Parallel Architectures and Performance Analysis – Slide 15

 Type 1: Centralized Multiprocessor  Straightforward extension of uniprocessor    Add CPUs to bus All processors share same primary memory Memory access time same for all CPUs  An example of a uniform memory access (UMA) multiprocessor  Symmetrical multiprocessor (SMP) Parallel Architectures and Performance Analysis – Slide 16

Parallel Architectures and Performance Analysis – Slide 17

    Private data: items used only by a single processor Shared data: values used by multiple processors In a centralized multiprocessor, processors communicate via shared data values Problems associated with shared data   Cache coherence  Replicating data across multiple caches reduces contention  How to ensure different processors have same value for same address?

Synchronization   Mutual exclusion Barriers Parallel Architectures and Performance Analysis – Slide 18

 Making the main memory of a cluster of computers look as though it is a single memory with a single address space (via hidden message passing).

 Then can use shared memory programming techniques.

Parallel Architectures and Performance Analysis – Slide 19

 Type 2: Distributed Multiprocessor  Distribute primary memory among processors    Increase aggregate memory bandwidth and lower average memory access time Allow greater number of processors Also called non-uniform memory access (NUMA) multiprocessor Parallel Architectures and Performance Analysis – Slide 20

Parallel Architectures and Performance Analysis – Slide 21

 Some NUMA multiprocessors do not support it in hardware   Only instructions, private data in cache Large memory access time variance  Implementations more difficult  No shared memory bus to “snoop”  Directory-based protocol needed Parallel Architectures and Performance Analysis – Slide 22

 Distributed directory contains information about cacheable memory blocks  One directory entry for each cache block  Each entry has   Sharing status    Uncached: block not in any processor’s cache Shared: cached by one or more processors; read only Exclusive: cached by exactly one processor which has written block, so copy in memory obsolete Which processors have copies Parallel Architectures and Performance Analysis – Slide 23

 Complete computers connected through an interconnection network Parallel Architectures and Performance Analysis – Slide 24

 Distributed memory multiple-CPU computer  Same address on different processors refers to different physical memory locations  Processors interact through message passing  Commercial multicomputers  Commodity clusters Parallel Architectures and Performance Analysis – Slide 25

 Alternate name for message-passing multicomputer systems.

 Each processor has its own memory accessible only to that processor.

 A message passing interconnection network provides point-to-point connections among processors.

 Memory access varies between processors.

Parallel Architectures and Performance Analysis – Slide 26

Parallel Architectures and Performance Analysis – Slide 27

 Advantages:   Back-end processors dedicated to parallel computations  Easier to understand, model, tune performance Only a simple back-end operating system needed  Easy for a vendor to create  Disadvantages:  Front-end computer is a single point of failure    Single front-end computer limits scalability of system Primitive operating system in back-end processors makes debugging difficult Every application requires development of both front end and back-end programs Parallel Architectures and Performance Analysis – Slide 28

Parallel Architectures and Performance Analysis – Slide 29

 Advantages:   Alleviate performance bottleneck caused by single front end computer Better support for debugging  Every processor executes same program  Disadvantages:    More difficult to maintain illusion of single “parallel computer” No simple way to balance program development workload among processors More difficult to achieve high performance when multiple processes on each processor Parallel Architectures and Performance Analysis – Slide 30

Parallel Architectures and Performance Analysis – Slide 31

 Michael Flynn (1966) created a classification for computer architectures based upon a variety of characteristics, specifically instruction streams and data streams.

 Also important are number of processors, number of programs which can be executed, and the memory structure.

Parallel Architectures and Performance Analysis – Slide 32

 Single instruction stream, single data stream (SISD) computer  In a single processor computer, a single stream of instructions is generated from the program. The instructions operate upon a single stream of data items.

 The single CPU executes one instruction at a time and fetches or stores one item of data at a time.

Parallel Architectures and Performance Analysis – Slide 33

Control unit Instruction Control Signals Results Memory Arithmetic Processor Data Stream Parallel Architectures and Performance Analysis – Slide 34

 Single instruction stream, multiple data stream (SIMD) computer  A specially designed computer in which a single instruction stream is from a single program, but multiple data streams exist.

  The instructions from the program are broadcast to more than one processor.

Each processor executes the same instruction in synchronism, but using different data.

 Developed because there are a number of important applications that mostly operate upon arrays of data.

Parallel Architectures and Performance Analysis – Slide 35

Control Unit Control Signal PE 1 PE 2 Data Stream 1 Data Stream 2 PE

Data Stream

Parallel Architectures and Performance Analysis – Slide 36

 Processing distributed over a large amount of hardware.

 Operates concurrently on many different data elements.

 Performs the same computation on all data elements.

 Processors operate synchronously .

 Examples: pipelined vector processors (e.g. Cray-1) and processor arrays (e.g. Connection Machine) Parallel Architectures and Performance Analysis – Slide 37

X 1 a=0 ?

No Yes X 2 X 3 All PEs X 1 PEs satisfy a = 0, others are idle X 2 PEs satisfy a ≠ 0, others are idle X 3 X 4 SISD machine All PEs X 4 SIMD machine Parallel Architectures and Performance Analysis – Slide 38

 Multiple instruction stream, single data stream (MISD) computer  MISD machines may execute several different programs on the same data item.

 There are two categories   Distinct processing units perform distinct instructions on the same data. Currently there is no such machine.

Pipelined architectures, where data flows through a series of processing elements.

Parallel Architectures and Performance Analysis – Slide 39

Instruction Stream 1 Control Unit 1 Control Unit 2 Instruction Stream 2 Processing Element 1 Processing Element 2 Data Stream Control Unit

Instruction Stream

Processing Element

Parallel Architectures and Performance Analysis – Slide 40

 A pipeline processor works according to the principle of pipelining.

   A process can be broken down into several stages (segments). While one stage is executing, another stage is being loaded and the input of one stage is the output of the previous stage.

The processor carries out many different computations concurrently.

 Example: systolic array Parallel Architectures and Performance Analysis – Slide 41

S1 S2 S3 S4 S1 S2 S3 S4 Serial execution of two processes with 4 stages each. Time to execute T = 8 t , where t is the time to execute one stage .

S1 S2 S1 S3 S2 S4 S3 S4 Pipelined execution of the same two processes.

T = 5 t Parallel Architectures and Performance Analysis – Slide 42

 Multiple instruction stream, multiple data stream (MIMD) computer  General purpose multiprocessor system.

   Multiple processors, each with a separate (different) program operating on its own data.

One instruction stream is generated from each program for each processor.

Each instruction operates upon different data.

 Both the shared memory and the message-passing multiprocessors so far described are in the MIMD classification.

Parallel Architectures and Performance Analysis – Slide 43

Instruction Stream 1 Control Unit 1 Control Unit 2 Instruction Stream 2 Processing Element 1 Processing Element 2 Data Stream 1 Data Stream 2 Control Unit

Instruction Stream

Processing Element

Data Stream

Parallel Architectures and Performance Analysis – Slide 44

 Processing distributed over a number of processors operating independently and concurrently.

  Resources (memory) shared among processors.

Each processor runs its own program.

 MIMD systems execute operations in a parallel asynchronous fashion.

Parallel Architectures and Performance Analysis – Slide 45

 Differ with regard to  Interconnection networks    Memory addressing techniques Synchronization Control structures  A high throughput can be achieved if the processing can be broken into parallel streams keeping all the processors active concurrently.

Parallel Architectures and Performance Analysis – Slide 46

 Multiple Program Multiple Data (MPMD) Structure  Within the MIMD classification, which we are concerned with, each processor will have its own program to execute.

Parallel Architectures and Performance Analysis – Slide 47

 Single Program Multiple Data (SPMD) Structure  Single source program is written and each processor will execute its personal copy of this program, although independently and not in synchronism.

 The source program can be constructed so that parts of the program are executed by certain computers and not others depending upon the identity of the computer.

 Software equivalent of SIMD; can perform SIMD calculations on MIMD hardware.

Parallel Architectures and Performance Analysis – Slide 48

 SIMD needs less hardware (only one control unit). In MIMD each processor has its own control unit.

 SIMD needs less memory than MIMD (SIMD need only one copy of instructions). In MIMD the program and operating system needs to be stored at each processor.

 SIMD has implicit synchronization of PEs. In contrast, explicit synchronization may be required in MIMD. Parallel Architectures and Performance Analysis – Slide 49

 MIMD allows different operations to be performed on different processing elements simultaneously (functional parallelism). SIMD is limited to data parallelism.

 For MIMD it is possible to use general-purpose microprocessor as a processing unit. Processor may be cheaper and more powerful.

Parallel Architectures and Performance Analysis – Slide 50

 Time to execute a sequence of instructions in which the execution time is data dependent is less for MIMD than for SIMD.

  MIMD allows each instruction to execute independently. In SIMD each processing element must wait until all the others have finished the execution of one instruction.

Thus    T(MIMD) = MAX {t 1 + t 2 + … + t

} T(SIMD) = MAX {t 1 } + MAX {t 2 } + … + MAX {t

}  T(MIMD) ≤ T(SIMD) Parallel Architectures and Performance Analysis – Slide 51

 In MIMD each processing element can independently follow either direction path in executing if-then-else statement. This requires two phases on SIMD.

 MIMD can operate in SIMD mode.

Parallel Architectures and Performance Analysis – Slide 52

 Architectures  Vector computers   Shared memory multiprocessors: tightly coupled   Centralized/symmetrical multiprocessor (SMP): UMA Distributed multiprocessor: NUMA Distributed memory/message-passing multicomputers: loosely coupled  Asymmetrical vs. symmetrical  Flynn’s Taxonomy  SISD, SIMD, MISD, MIMD (MPMD, SPMD) Parallel Architectures and Performance Analysis – Slide 53

 A sequential algorithm can be evaluated in terms of its execution time, which can be expressed as a function of the size of its input.

 The execution time of a parallel algorithm depends not only on the input size of the problem but also on the architecture of a parallel computer and the number of available processing elements.

Parallel Architectures and Performance Analysis – Slide 54

 The degree of parallelism is a measure of the number of operations that an algorithm can perform in parallel for a problem of size W, and it is independent of the parallel architecture.

 If P(W) is the degree of parallelism of a parallel algorithm, then for a problem of size W no more than P(W) processors can be employed effectively.

 Want to be able to do two things: predict performance of parallel programs, and understand barriers to higher performance.

Parallel Architectures and Performance Analysis – Slide 55

 General speedup formula  Amdahl’s Law  Decide if program merits parallelization  Gustafson-Barsis’ Law  Evaluate performance of a parallel program Parallel Architectures and Performance Analysis – Slide 56

 The speedup factor is a measure that captures the relative benefit of solving a computational problem in parallel.

 The speedup factor of a parallel computation utilizing p processors is defined as the following ratio:  In other words, S(p) is defined as the ratio of the sequential processing time to the parallel processing time.

Parallel Architectures and Performance Analysis – Slide 57

 Speedup factor can also be cast in terms of computational steps:  Maximum speedup is (usually) p with p processors (linear speedup).

Parallel Architectures and Performance Analysis – Slide 58

 It is assumed that the processor used in parallel computation is identical to the one used by sequential algorithm.

 S(p) gives the increase in speed by using a multiprocessor.

 Underlying algorithm for parallel implementation might be (and usually is) different.

Parallel Architectures and Performance Analysis – Slide 59

 The sequential algorithm has to be the best algorithm known for a particular computation problem.

 This means that it is fair to judge the performance of parallel computation with respect to the fastest sequential algorithm for solving the same problem in a single processor architecture.

 Several issues such as synchronization and communication are involved in the parallel computation.

Parallel Architectures and Performance Analysis – Slide 60

 Given a problem of size n on p processors let    Inherently sequential computations Potentially parallel computations Communication operations  (n)  (n)  (n,p)  Then: Parallel Architectures and Performance Analysis – Slide 61

Computation Time Communication Time “elbowing out” Number of processors  Parallel Architectures and Performance Analysis – Slide 62

 The efficiency of a parallel computation is defined as a ratio between the speedup factor and the number of processing elements in a parallel system:

 Exec.

time using one processor

 Exec.

time using

processors 

p T s



T p



(

)

 Efficiency is a measure of the fraction of time for which a processing element is usefully employed in a computation.

Parallel Architectures and Performance Analysis – Slide 63

 In an ideal parallel system the speedup factor is equal to p and the efficiency is equal to one.

 In practice ideal behavior is not achieved, since processors cannot devote 100 percent of their time to the computation.

 Every parallel program has overhead factors such as creating processes, process synchronization and communication.

 In practice efficiency is between zero and one, depending on the degree of effectiveness with which processing elements are utilized.

Parallel Architectures and Performance Analysis – Slide 64

 Since E = S(p)/p, by what we did earlier  Since all terms are positive, E > 0  Furthermore, since the denominator is larger than the numerator, E < 1 Parallel Architectures and Performance Analysis – Slide 65

 Consider the problem of adding n numbers on a p processor system.

 Initial brute force approach: all tasks send values to one processor which adds them all up..

Parallel Architectures and Performance Analysis – Slide 66

 Parallel algorithm: find the global sum by using a binomial tree.

S Parallel Architectures and Performance Analysis – Slide 67

 Assume it takes one unit of time for two directly connected processors to add two numbers and to communicate to each other.

 Adding n/p numbers locally on each processor takes n/p –1 units of time.

 The p partial sums may be added in log p steps, each consisting of one addition and one communication. Parallel Architectures and Performance Analysis – Slide 68

 The total parallel computation time T

is n/p – 1 + 2 log p.  For large values of p and n this can be approximated by

T p

= n / p + 2 log p .

 The serial computation time can be approximated by

T s

= n .

Parallel Architectures and Performance Analysis – Slide 69

 The expression for speedup is

(

) 

T T p s



n p n

 2 log



np n

 2

log

 The expression for efficiency is



(

) 

p n n

 2

log

 Speedup and efficiency can be calculated for any p and

Parallel Architectures and Performance Analysis – Slide 70

 Computational efficiency as a function of n and p.

64 192 320 512 1 1 1 1 1 2 processors p 4 8 16 32 .980 .930 .815 .623 .399

.990 .975 .930 .832 .665

.995 .985 .956 .892 .768

.995 .990 .972 .930 .841

Parallel Architectures and Performance Analysis – Slide 71

30 25 20 15 10 5 0 0

512

320

192

64 10 20 processors 30 Parallel Architectures and Performance Analysis – Slide 72

Parallel Architectures and Performance Analysis – Slide 73

 As before since the communication time must be non-trivial.

 Let f represent the inherently sequential portion of the computation; then Parallel Architectures and Performance Analysis – Slide 74

 Then  In short, the maximum speedup factor is given by where f is the fraction of the computation that cannot be divided into concurrent tasks.

Parallel Architectures and Performance Analysis – Slide 75

 Limitations   Ignores communication time Overestimates speedup achievable  Amdahl Effect   Typically  (n,p) has lower complexity than  (n)/p So as p increases,  (n)/p dominates  (n,p)  Thus as p increases, speedup increases Parallel Architectures and Performance Analysis – Slide 76

  Even with an infinite number of processors, maximum speedup limited to 1 /

Ex: With only 5% of a computation being serial, maximum speedup is 20, irrespective of number of processors.

Parallel Architectures and Performance Analysis – Slide 77

 So Amdahl’s Law  Treats problem size as a constant  Shows how execution time decreases as the number of processors increases  However, we often use faster computers to solve larger problem instances  Let’s treat time as a constant and allow the problem size to increase with the number of processors Parallel Architectures and Performance Analysis – Slide 78

 As before  Let s represent the fraction of time spent in parallel computation performing inherently sequential operations; then Parallel Architectures and Performance Analysis – Slide 79

 Then Parallel Architectures and Performance Analysis – Slide 80

 Begin with parallel execution time instead of sequential time  Estimate sequential execution time to solve same problem  Problem size is an increasing function of p  Predicts scaled speedup Parallel Architectures and Performance Analysis – Slide 81

 An application running on 10 processors spends 3% of its time in serial code.

 According to Amdahl’s Law the maximum speedup is  However the scaled speedup is Parallel Architectures and Performance Analysis – Slide 82

 Both Amdahl’s Law and Gustafson-Barsis’ Law ignore communication time  Both overestimate speedup or scaled speedup achievable Gene Amdahl John L. Gustafson Parallel Architectures and Performance Analysis – Slide 83

 Performance terms: speedup, efficiency  Model of speedup: serial, parallel and communication components  What prevents linear speedup?

    Serial and communication operations Process start-up Imbalanced workloads Architectural limitations  Analyzing parallel performance  Amdahl’s Law  Gustafson-Barsis’ Law Parallel Architectures and Performance Analysis – Slide 84

 Based on original material from  The University of Akron: Tim O’Neil, Kathy Liszka    Hiram College: Irena Lomonosov The University of North Carolina at Charlotte  Barry Wilkinson, Michael Allen Oregon State University: Michael Quinn  Revision history: last updated 7/28/2011.

Parallel Architectures and Performance Analysis – Slide 85