Document 7500715

Download Report

Transcript Document 7500715

High Performance Computing: Concepts, Methods & Means
Parallel Algorithms 1
Prof. Thomas Sterling
Department of Computer Science
Louisiana State University
March 1st, 2007
Topics
•
•
•
•
•
•
•
Introduction
Array Decomposition
Mandelbrot Sets
Monte Carlo : PI Calculation
Matrix Multiplication
N-Body Problem
Summary – Materials for Test
2
Topics
• Introduction
•
•
•
•
•
•
Array Decomposition
Mandelbrot Sets
Monte Carlo : PI Calculation
Matrix Multiplication
N-Body Problem
Summary – Materials for Test
3
“embarrassingly parallel”
• Common phrase
– poorly defined,
– widely used
• Suggests lots and lots of parallelism
– with essentially no inter task communication or coordination
– Highly partitionable workload with minimal overhead
• “almost embarrassingly parallel”
–
–
–
–
Same as above, but
Requires master to launch many tasks
Requires master to collect final results of tasks
Sometimes still referred to as “embarrassingly parallel”
4
Basic Parallel (MPI) Program Steps
•
•
•
•
•
Establish logical bindings
Initialize application execution environment
Distribute data and work
Perform core computations in parallel (across nodes)
Synchronize and Exchange intermediate data results
– Optional for non-embarrassingly parallel (cooperative)
• Detect “stop” condition
– Maybe implicit with a barrier etc.
• Aggregate final results
– Often a reduction operator
• Output results and error code
• Terminate and return to OS
5
Parallel Programming
• Goals
–
–
–
–
–
Correctness
Reduction in execution time
Efficiency
Scalability
Increased problem size and richness of models
• Objectives
– Expose parallelism
• Algorithm design
– Distribute work uniformly
• Data decomposition and allocation
• Dynamic load balancing
– Minimize overhead of synchronization and communication
• Coarse granularity
• Big messages
– Minimize redundant work
• Still sometimes better than communication
6
Topics
•
•
•
•
•
•
•
Introduction
Array Decomposition
Mandelbrot Sets
Monte Carlo : PI Calculation
Matrix Multiplication
N-Body Problem
Summary – Materials for Test
7
Array Decomposition
• Simple MPI Example
• Master-Worker Data Partitioning and Distribution
– Array decomposition
– Uniformly distributes parts of array among workers
• (and master)
– A kind of static load balancing
• Assumes equal work on equal data set sizes
• Demonstrates
– Data partitioning
– Data distribution
– Coarse grain parallel execution
• No communication between tasks
– Reduction operator
– Master-worker control model
8
Array Decomposition (LLNL Web)
Demonstrate simple data decomposition :
– Master initializes array and then distributes an equal portion of the array
among the other tasks.
– The other tasks receive their portion of the array, they perform an
addition operation to each array element.
– Each task maintains the sum for their portion of the array
– The master task does likewise with its portion of the array.
– As each of the non-master tasks finish, they send their updated portion
of the array to the master.
– An MPI collective communication call is used to collect the sums
maintained by each task.
– Finally, the master task displays selected parts of the final array and the
global sum of all array elements.
– Assumption : that the array can be equally divided among the group.
Source : http://www.llnl.gov/computing/tutorials/mpi/samples/C/mpi_array.c
9
Array Decomposition
•
•
•
•
•
•
Array Processing : A Parallel Solution
Arrays elements are distributed so that each processor owns a portion of an
array (subarray).
Independent calculation of array elements insures there is no need for
communication between tasks.
Distribution scheme is chosen by other criteria, e.g. unit stride (stride of 1)
through the subarrays. Unit stride maximizes cache/memory usage.
Since it is desirable to have unit stride through the subarrays, the choice of a
distribution scheme depends on the programming language. See the BlockCyclic Distribution Diagram for the options.
After the array is distributed, each task executes the portion of the loop
corresponding to the data it owns. For example, with Fortran block distribution:
do j = mystart, myend
do i = 1,n
a(i,j) = fcn(i,j)
end do
end do
•
Notice that only the outer loop variables are different from the serial solution.
10
Array Decomposition Locality
• Maximize locality
– Spatial locality
•
•
•
•
Variable likely to be used if neighbor data is used
Exploits unit or uniform stride access patterns
Exploits cache line length
Adjacent blocks minimizes message traffic
– Temporal locality
• Variable likely to be reused if already recently used
• Exploits cache loads and LRU (least recently used)
replacement policy
• Exploits register allocation
– Granularity
• Maximizes length of local computation
• Reduces number of messages
• Maximizes length of individual messages
11
Array Decomposition Layout
• Dimensions
–
–
–
–
1 dimension: linear
2 dimensions: “2-D” or
3 dimensions
Impacts surface to volume ratio for inter process
communications
• Distribution
– Block
• Minimizes messaging
• Maximizes message size
– Cyclic
• Improves load balancing
12
Array Decomposition
   
Accumulate sum from each part
 CompleteArray
13
Flowchart for Array Decomposition
“master”
Initialize MPI
Environment
“workers”
…
Initialize MPI
Environment
Initialize MPI
Environment
Initialize MPI
Environment
Recv. work
Recv. work
…
Recv. work
Calculate Sum
for array chunk
Calculate Sum
for array chunk
…
Calculate Sum
for array chunk
Send Sum
Send Sum
…
Send Sum
Initialize Array
Partition Array
into workloads
Send Workload
to “workers”
Calculate Sum
for array chunk
Recv. results
Reduction Operator
to Sum up results
Print results
End
14
Array Decompositon (source code)
#include "mpi.h"
#include <stdio.h>
#include <stdlib.h>
#define ARRAYSIZE
#define MASTER
16000000 /**Matrix Size 4000x4000**/
0
float data[ARRAYSIZE];
int main(argc,argv)
int argc;
char *argv[];
{
int numtasks, taskid, rc, dest, offset, i, j, tag1,
tag2, source, chunksize;
float mysum, sum;
float update(int myoffset, int chunk, int myid);
MPI_Status status;
/***** Initializing the MPI Environment *****/
MPI_Init(&argc, &argv);
MPI_Comm_size(MPI_COMM_WORLD, &numtasks);
if (numtasks % 4 != 0) {
printf("Quitting. Number of MPI tasks must be divisible by 4.\n"); /**For equal distribution of workload**/
MPI_Abort(MPI_COMM_WORLD, rc);
exit(0);
}
MPI_Comm_rank(MPI_COMM_WORLD,&taskid);
printf ("MPI task %d has started...\n", taskid);
chunksize = (ARRAYSIZE / numtasks);
tag2 = 1;
tag1 = 2;
15
Array Decompositon (source code)
/***** Master : Initializes the array and fills it with dummy values******/
if (taskid == MASTER){
/* Initialize the array */
sum = 0;
for(i=0; i<ARRAYSIZE; i++) {
data[i] = i * 1.0;
sum = sum + data[i];
}
printf("Initialized array sum = %e\n",sum);
/* Partition & Send each task its portion of the array - master keeps 1st part */
offset = chunksize;
for (dest=1; dest<numtasks; dest++) {
MPI_Send(&offset, 1, MPI_INT, dest, tag1, MPI_COMM_WORLD);
MPI_Send(&data[offset], chunksize, MPI_FLOAT, dest, tag2, MPI_COMM_WORLD);
printf("Sent %d elements to task %d offset= %d\n",chunksize,dest,offset);
offset = offset + chunksize;
}
/* Master does its part of the work */
offset = 0;
mysum = update(offset, chunksize, taskid);
/* Wait to receive results from each task */
for (i=1; i<numtasks; i++) {
source = i;
MPI_Recv(&offset, 1, MPI_INT, source, tag1, MPI_COMM_WORLD, &status);
MPI_Recv(&data[offset], chunksize, MPI_FLOAT, source, tag2, MPI_COMM_WORLD, &status);
}
16
Array Decompositon (source code)
/* Get final sum and print sample results */
MPI_Reduce(&mysum, &sum, 1, MPI_FLOAT, MPI_SUM, MASTER, MPI_COMM_WORLD);
printf("Sample results: \n");
offset = 0;
for (i=0; i<numtasks; i++) {
for (j=0; j<5; j++)
printf(" %e",data[offset+j]);
printf("\n");
offset = offset + chunksize;
}
printf("*** Final sum= %e ***\n",sum);
} /* end of master section */
/***** Non-master tasks only *****/
if (taskid > MASTER) {
/* Receive my portion of array from the master task */
source = MASTER;
MPI_Recv(&offset, 1, MPI_INT, source, tag1, MPI_COMM_WORLD, &status);
MPI_Recv(&data[offset], chunksize, MPI_FLOAT, source, tag2, MPI_COMM_WORLD, &status);
mysum = update(offset, chunksize, taskid);
/* Send my results back to the master task */
dest = MASTER;
MPI_Send(&offset, 1, MPI_INT, dest, tag1, MPI_COMM_WORLD);
MPI_Send(&data[offset], chunksize, MPI_FLOAT, MASTER, tag2, MPI_COMM_WORLD);
MPI_Reduce(&mysum, &sum, 1, MPI_FLOAT, MPI_SUM, MASTER, MPI_COMM_WORLD);
} /* end of non-master */
17
Array Decompositon (source code)
MPI_Finalize();
} /* end of main */
/**calculates the sum of the array chunk**/
float update(int myoffset, int chunk, int myid) {
int i;
float mysum;
/* Perform addition to each of my array elements and keep my sum */
mysum = 0;
for(i=myoffset; i < myoffset + chunk; i++) {
data[i] = data[i] + i * 1.0;
mysum = mysum + data[i];
}
printf("Task %d mysum = %e\n",myid,mysum);
return(mysum);
}
18
Demo : Array Decomposition
Output from Celeritas for a 4 processor run.
/home/cdekate/pa/array_decomposition
Wed Feb 28 09:08:52 CST 2007
MPI task 0 has started...
MPI task 2 has started...
MPI task 1 has started...
MPI task 3 has started...
Initialized array sum = 1.335708e+14
Sent 4000000 elements to task 1 offset= 4000000
Sent 4000000 elements to task 2 offset= 8000000
Task 1 mysum = 4.884048e+13
Sent 4000000 elements to task 3 offset= 12000000
Task 2 mysum = 7.983003e+13
Task 0 mysum = 1.598859e+13
Task 3 mysum = 1.161867e+14
Sample results:
0.000000e+00 2.000000e+00 4.000000e+00 6.000000e+00
8.000000e+06 8.000002e+06 8.000004e+06 8.000006e+06
1.600000e+07 1.600000e+07 1.600000e+07 1.600001e+07
2.400000e+07 2.400000e+07 2.400000e+07 2.400001e+07
*** Final sum= 2.608458e+14 ***
Wed Feb 28 09:08:53 CST 2007
8.000000e+00
8.000008e+06
1.600001e+07
2.400001e+07
19
Topics
•
•
•
•
•
•
•
Introduction
Array Decomposition
Mandelbrot Sets
Monte Carlo : PI Calculation
Matrix Multiplication
N-Body Problem
Summary – Materials for Test
20
21
22
23
24
25
26
27
Flowchart for Mandelbrot Set
Generation
“master”
“workers”
…
Initialize MPI
Environment
Initialize MPI
Environment
Initialize MPI
Environment
Create Local
Workload buffer
Create Local
Workload buffer
Create Local
Workload buffer
…
Create Local
Workload buffer
Isolate work
regions
Isolate work
regions
Isolate work
regions
…
Isolate work
regions
Calculate
Mandelbrot set
values across
work region
Calculate
Mandelbrot set
values across
work region
Calculate
Mandelbrot set
values across
work region
…
Calculate
Mandelbrot set
values across
work region
Write result from
task 0 to file
Send result
to “master”
Send result
to “master”
…
Initialize MPI
Environment
Send result
to “master”
Recv. results
from “workers”
Concatenate
results to file
End
28
Mandelbrot Sets (source code)
#include<stdio.h>
#include<assert.h>
#include<stdlib.h>
#include<mpi.h>
typedef struct complex{
double real;
double imag;
} Complex;
/* Calculate Pixel Values */
int cal_pixel(Complex c){
int count, max_iter;
Complex z;
double temp, lengthsq;
max_iter = 256;
z.real = 0;
z.imag = 0;
count = 0;
/* Core Mandelbrot Algorithm */
do{
temp = z.real * z.real - z.imag * z.imag + c.real;
z.imag = 2 * z.real * z.imag + c.imag;
z.real = temp;
lengthsq = z.real * z.real + z.imag * z.imag;
count ++;
}
while ((lengthsq < 4.0) && (count < max_iter));
return(count);
}
Source : http://people.cs.uchicago.edu/~asiegel/courses/cspp51085/lesson2/examples/
29
Mandelbrot Sets (source code)
#define MASTERPE 0
int main(int argc, char **argv){
FILE *file;
int i, j;
Complex c;
int tmp;
double *data_l, *data_l_tmp;
int nx, ny;
int mystrt, myend;
int nrows_l;
int nprocs, mype;
MPI_Status status;
/***** Initializing MPI Environment*****/
MPI_Init(&argc, &argv);
MPI_Comm_size(MPI_COMM_WORLD, &nprocs);
MPI_Comm_rank(MPI_COMM_WORLD, &mype);
/***** Pass in the dimension (X,Y) of the area to cover *****/
if (argc != 3){
int err = 0;
printf("argc %d\n", argc);
if (mype == MASTERPE){
printf("usage: mandelbrot nx ny");
MPI_Abort(MPI_COMM_WORLD,err );
}
}
/* get command line args */
nx = atoi(argv[1]);
ny = atoi(argv[2]);
Source : http://people.cs.uchicago.edu/~asiegel/courses/cspp51085/lesson2/examples/
30
Mandelbrot Sets (source code)
/* assume divides equally */ /***** Mark work region across processors*****/
nrows_l = nx/nprocs;
/* create buffer for local work only */
data_l = (double *) malloc(nrows_l * ny * sizeof(double));
data_l_tmp = data_l;
/* calculate each processor's region of work */
/***** Distribute region across processors*****/
/***** All processors including the “master” and “workers”
perform the following*****/
mystrt = mype*nrows_l;
myend = mystrt + nrows_l - 1;
/* calc each procs coordinates and call local mandelbrot set function */
for (i = mystrt; i <= myend; ++i){
c.real = i/((double) nx) * 4. - 2. ;
for (j = 0; j < ny; ++j){
c.imag = j/((double) ny) * 4. - 2. ;
tmp = cal_pixel(c);
/***** Calculate value for each coordinate point*****/
*data_l++ = (double) tmp;
}
}
data_l = data_l_tmp;
/***** If “master” then begin writing output and then wait for
“workers” to send their result*****/
if (mype == MASTERPE){
file = fopen("mandelbrot.bin_0000", "w");
printf("nrows_l, ny %d %d\n", nrows_l, ny);
fwrite(data_l, nrows_l*ny, sizeof(double), file);
fclose(file);
Source : http://people.cs.uchicago.edu/~asiegel/courses/cspp51085/lesson2/examples/
31
Mandelbrot Sets (source code)
for (i = 1; i < nprocs; ++i){
MPI_Recv(data_l, nrows_l * ny, MPI_DOUBLE, i, 0, MPI_COMM_WORLD, &status);
printf("received message from proc %d\n", i);
file = fopen("mandelbrot.bin_0000", "a");
fwrite(data_l, nrows_l*ny, sizeof(double), file);
fclose(file);
}
}
/***** If “worker” send the calculated pixel array to the the “master”*****/
else{
MPI_Send(data_l, nrows_l * ny, MPI_DOUBLE, MASTERPE, 0, MPI_COMM_WORLD);
}
MPI_Finalize();
}
Source : http://people.cs.uchicago.edu/~asiegel/courses/cspp51085/lesson2/examples/
32
Demo : Mandelbrot Sets
33
Topics
•
•
•
•
•
•
•
Introduction
Array Decomposition
Mandelbrot Sets
Monte Carlo : PI Calculation
Matrix Multiplication
N-Body Problem
Summary – Materials for Test
34
35
36
37
Monte Carlo
•
•
•
•
•
•
•
The value of PI can be calculated in a number of
ways. Consider the following method of
approximating PI Inscribe a circle in a square
Randomly generate points in the square
Determine the number of points in the square that
are also in the circle
Let r be the number of points in the circle divided
by the number of points in the square
PI ~ 4 r
Note that the more points generated, the better
the approximation
Algorithm :
npoints = 10000
circle_count = 0
do j = 1,npoints
generate 2 random numbers between 0 and 1
xcoordinate = random1 ; ycoordinate = random2
if (xcoordinate, ycoordinate) inside circle
then circle_count = circle_count + 1
end do
PI = 4.0*circle_count/npoints
38
Monte Carlo Pi Computation
Master
Worker
Initialize MPI Environment
Initialize MPI Environment
Server
Initialize MPI
Environment
Broadcast Error Bound
Receive Error Bound
Send Request to Server
Receive
Request
Send Request to Server
Receive Random Array
Receive Random Array
Compute
Random Array
Perform Computations
Perform Computations
Propagate Number of Points
(Allreduce)
Output Partial Result
N
Stop Condition
Satisfied?
Y
Print Statistics
Finalize MPI
Send Array to
Requestor
Propagate Number of Points
(Allreduce)
N
N
Stop Condition
Satisfied?
Last
Request?
Y
Y
Finalize MPI
Finalize MPI
39
Creating Custom Communicators
• Communicators define groups and the access
patterns among them
• Default communicator is MPI_COMM_WORLD
• Some algorithms demand more sophisticated control
of communications to take advantage of reduction
operators
• MPI permits creation of custom communicators
• MPI_COMM_create
40
Monte Carlo : Pi (source code)
#include <stdio.h>
#include <math.h>
#include "mpi.h“
#define CHUNKSIZE
1000
/* We'd like a value that gives the maximum value returned by the function
random, but no such value is *portable*. RAND_MAX is available on many
systems but is not always the correct value for random (it isn't for
Solaris). The value ((unsigned(1)<<31)-1) is common but not guaranteed */
#define INT_MAX 1000000000
/* message tags */
#define REQUEST 1
#define REPLY 2
int main( int argc, char *argv[] )
{
int iter;
int in, out, i, iters, max, ix, iy, ranks[1], done, temp;
double x, y, Pi, error, epsilon;
int numprocs, myid, server, totalin, totalout, workerid;
int rands[CHUNKSIZE], request;
MPI_Comm world, workers;
MPI_Group world_group, worker_group;
MPI_Status status;
/* Initialize MPI Environment */
MPI_Init(&argc,&argv);
world = MPI_COMM_WORLD;
MPI_Comm_size(world,&numprocs);
MPI_Comm_rank(world,&myid);
41
Monte Carlo : Pi (source code)
server = numprocs-1;
/* last proc is server */
if (myid == 0)
sscanf( argv[1], "%lf", &epsilon );
/* Broadcast Error Bounds */
MPI_Bcast( &epsilon, 1, MPI_DOUBLE, 0, MPI_COMM_WORLD );
MPI_Comm_group( world, &world_group );
ranks[0] = server;
MPI_Group_excl( world_group, 1, ranks, &worker_group );
MPI_Comm_create( world, worker_group, &workers ); /* Creating a custom communicator */
MPI_Group_free(&worker_group);
if (myid == server) {
do {
/* Server Block */
/* I am the rand server */
/* Recv. Request for random number array*/
MPI_Recv(&request, 1, MPI_INT, MPI_ANY_SOURCE, REQUEST, world, &status);
if (request) {
/* Computing the random array */
for (i = 0; i < CHUNKSIZE; ) {
rands[i] = random();
if (rands[i] <= INT_MAX) i++;
}
/* Send random number array*/
MPI_Send(rands, CHUNKSIZE, MPI_INT, status.MPI_SOURCE, REPLY, world);
}
}
while( request>0 );
}
/* End Server Block */
else {
/* Begin Worker Block */
request = 1;
done = in = out = 0;
max = INT_MAX;
/* max int, for normalization */
MPI_Send( &request, 1, MPI_INT, server, REQUEST, world );
MPI_Comm_rank( workers, &workerid );
iter = 0;
42
Monte Carlo : Pi (source code)
while (!done) {
iter++;
request = 1;
/* Recv. random array from server*/
MPI_Recv( rands, CHUNKSIZE, MPI_INT, server, REPLY, world, &status );
/* Derive random x,y coordinates for each pair(x,y) of items in the Random Array */
for (i=0; i<CHUNKSIZE-1; ) {
x = (((double) rands[i++])/max) * 2 - 1;
y = (((double) rands[i++])/max) * 2 - 1;
if (x*x + y*y < 1.0)
in++;
else
out++;
}
/* Perform reduction and broadcast of */
MPI_Allreduce(&in, &totalin, 1, MPI_INT, MPI_SUM, workers);
MPI_Allreduce(&out, &totalout, 1, MPI_INT, MPI_SUM, workers);
Pi = (4.0*totalin)/(totalin + totalout);
/* Compute PI*/
error = fabs( Pi-3.141592653589793238462643);
/* Calculate Error */
done = (error < epsilon || (totalin+totalout) > 1000000);
request = (done) ? 0 : 1;
if (myid == 0) {
/* If “Master” : Print current value of PI */
printf( "\rpi = %23.20f", Pi );
MPI_Send( &request, 1, MPI_INT, server, REQUEST, world );
}
else {
/* If “Worker” : Request new array if not finished */
if (request)
MPI_Send(&request, 1, MPI_INT, server, REQUEST, world);
}
}
MPI_Comm_free(&workers);
}
43
Monte Carlo : Pi (source code)
if (myid == 0) {
/* If “Master” : Print Results */
printf( "\npoints: %d\nin: %d, out: %d, <ret> to exit\n",
totalin+totalout, totalin, totalout );
getchar();
}
MPI_Finalize();
}
44
Demo : Monte Carlo Pi
> mpirun –np 4 monte 1e-20
pi = 3.14164517741129456496
points: 1000500
in: 785804, out: 214696
45
Topics
•
•
•
•
•
•
•
Introduction
Array Decomposition
Mandelbrot Sets
Monte Carlo : PI Calculation
Matrix Multiplication
N-Body Problem
Summary – Materials for Test
46
Matrix Multiplication (LLNL Web)
• MPI Example : MPI Matrix Multiply - C Version
• In this code, the master task distributes a matrix multiply
operation to numtasks-1 worker tasks.
• NOTE: C and Fortran versions of this code differ because of the
way arrays are stored/passed. C arrays are row-major order but
Fortran arrays are column-major order.
Source : http://www.llnl.gov/computing/tutorials/mpi/samples/C/mpi_mm.c
47
Matrices — A Review
An n x m matrix
Slides for Parallel Programming Techniques & Applications Using Networked Workstations & Parallel Computers 2nd ed., by B.
Wilkinson & M. Allen,
@ 2004 Pearson Education Inc. All rights reserved.
48
Matrix Addition
Involves adding corresponding elements of each matrix to
form the result matrix.
Given the elements of A as ai,j and the elements of B as
bi,j, each element of C is computed as
Slides for Parallel Programming Techniques & Applications Using Networked Workstations & Parallel Computers 2nd ed., by B.
Wilkinson & M. Allen,
@ 2004 Pearson Education Inc. All rights reserved.
49
Matrix Multiplication
Multiplication of two matrices, A and B, produces the matrix C
whose elements, ci,j (0 <= i < n, 0 <= j < m), are computed as
follows:
where A is an n x l matrix and B is an l x m matrix.
Slides for Parallel Programming Techniques & Applications Using Networked Workstations & Parallel Computers 2nd ed., by B.
Wilkinson & M. Allen,
@ 2004 Pearson Education Inc. All rights reserved.
50
Matrix multiplication, C = A x B
Slides for Parallel Programming Techniques & Applications Using Networked Workstations & Parallel Computers 2nd ed., by B.
Wilkinson & M. Allen,
@ 2004 Pearson Education Inc. All rights reserved.
51
Matrix-Vector Multiplication
c=Axb
Matrix-vector multiplication follows directly from the definition of
matrix-matrix multiplication by making B an n x1 matrix (vector).
Result an n x 1 matrix (vector).
52
Relationship of Matrices to Linear
Equations
A system of linear equations can be written in matrix form:
Ax = b
Matrix A holds the a constants
x is a vector of the unknowns
b is a vector of the b constants.
Slides for Parallel Programming Techniques & Applications Using Networked Workstations & Parallel Computers 2nd ed., by B.
Wilkinson & M. Allen,
@ 2004 Pearson Education Inc. All rights reserved.
53
Implementing Matrix Multiplication
Sequential Code
Assume throughout that the matrices are square (n x n matrices).
The sequential code to compute A x B could simply be
for (i = 0; i < n; i++)
for (j = 0; j < n; j++) {
c[i][j] = 0;
for (k = 0; k < n; k++)
c[i][j] = c[i][j] + a[i][k] * b[k][j];
}
This algorithm requires n3 multiplications and n3 additions,
leading to a sequential time complexity of O(n3).
Very easy to parallelize.
Slides for Parallel Programming Techniques & Applications Using Networked Workstations & Parallel Computers 2nd ed., by B.
Wilkinson & M. Allen,
@ 2004 Pearson Education Inc. All rights reserved.
54
Partitioning into Submatrices
Suppose matrix divided into s2 submatrices. Each submatrix has n/
s x n/s elements. Using notation Ap,q as submatrix in submatrix row
p and submatrix column q:
for (p = 0; p < s; p++)
for (q = 0; q < s; q++) {
Cp,q = 0;
for (r = 0; r < m; r++)
Cp,q = Cp,q + Ap,r * Br,q;
}
/* clear elements of submatrix */
/* submatrix multiplication &*/
/*add to accum. submatrix*/
The line
Cp,q = Cp,q + Ap,r * Br,q;
means multiply submatrix Ap,r and Br,q using matrix multiplication
and add to submatrix Cp,q using matrix addition. Known as block
matrix multiplication.
Slides for Parallel Programming Techniques & Applications Using Networked Workstations & Parallel Computers 2nd ed., by B.
Wilkinson & M. Allen,
@ 2004 Pearson Education Inc. All rights reserved.
55
Block Matrix Multiplication
Slides for Parallel Programming Techniques & Applications Using Networked Workstations & Parallel Computers 2nd ed., by B.
Wilkinson & M. Allen,
@ 2004 Pearson Education Inc. All rights reserved.
56
Slides for Parallel Programming Techniques & Applications Using Networked Workstations & Parallel Computers 2nd ed., by B.
Wilkinson & M. Allen,@ 2004 Pearson Education Inc. All rights reserved.
57
Direct Implementation
One processor to compute each element of C – n^2 processors
would be needed. One row of elements of A and one column of
elements of B needed. Some of same elements sent to more than
one processor. Can use submatrices.
Slides for Parallel Programming Techniques & Applications Using Networked Workstations & Parallel Computers 2nd ed., by B.
Wilkinson & M. Allen,
@ 2004 Pearson Education Inc. All rights reserved.
58
Performance Improvement
Using tree construction n numbers can be added in log n steps
using n processors:
Computational time
complexity of O(log n)
using n3 processors.
Slides for Parallel Programming Techniques & Applications Using Networked Workstations & Parallel Computers 2nd ed., by B.
Wilkinson & M. Allen,
@ 2004 Pearson Education Inc. All rights reserved.
59
Flowchart for Matrix Multiplication
“master”
Initialize MPI
Environment
“workers”
…
Initialize MPI
Environment
Initialize MPI
Environment
Initialize MPI
Environment
Recv. work
Recv. work
…
Recv. work
Calculate
matrix product
Calculate
matrix product
…
Calculate
matrix product
Send result
Send result
…
Send result
Initialize Array
Partition Array
into workloads
Send Workload
to “workers”
wait for “workers“
to finish task
Recv. results
Print results
End
60
Matrix Multiplication (source code)
#include "mpi.h"
#include <stdio.h>
#include <stdlib.h>
#define NRA 62
#define NCA 15
#define NCB 7
#define MASTER 0
#define FROM_MASTER 1
#define FROM_WORKER 2
int main(argc,argv)
int argc;
char *argv[];
{
int
numtasks,
taskid,
numworkers,
source,
dest,
mtype,
rows,
averow, extra, offset,
i, j, k, rc;
double
a[NRA][NCA],
b[NCA][NCB],
c[NRA][NCB];
MPI_Status status;
/* number of rows in matrix A */
/* number of columns in matrix A */
/* number of columns in matrix B */
/* taskid of first task */
/* setting a message type */
/* setting a message type */
/* number of tasks in partition */
/* a task identifier */
/* number of worker tasks */
/* task id of message source */
/* task id of message destination */
/* message type */
/* rows of matrix A sent to each worker */
/* used to determine rows sent to each worker */
/* misc */
/* matrix A to be multiplied */
/* matrix B to be multiplied */
/* result matrix C */
Source : http://www.llnl.gov/computing/tutorials/mpi/samples/C/mpi_mm.c
61
Matrix Multiplication (source code)
/* Initialize MPI Environment */
MPI_Init(&argc,&argv);
MPI_Comm_rank(MPI_COMM_WORLD,&taskid);
MPI_Comm_size(MPI_COMM_WORLD,&numtasks);
if (numtasks < 2 ) {
printf("Need at least two MPI tasks. Quitting...\n");
MPI_Abort(MPI_COMM_WORLD, rc);
exit(1);
}
numworkers = numtasks-1;
/* Master block*/
if (taskid == MASTER)
{
printf("mpi_mm has started with %d tasks.\n",numtasks);
printf("Initializing arrays...\n");
for (i=0; i<NRA; i++)
for (j=0; j<NCA; j++)
a[i][j]= i+j;
/* Initialize array a */
for (i=0; i<NCA; i++)
for (j=0; j<NCB; j++)
b[i][j]= i*j;
/* Initialize array b */
/* Send matrix data to the worker tasks */
averow = NRA/numworkers; /* determining fraction of array to be processed by “workers” */
extra = NRA%numworkers;
offset = 0;
mtype = FROM_MASTER; /* Message Tag */
Source : http://www.llnl.gov/computing/tutorials/mpi/samples/C/mpi_mm.c
62
Matrix Multiplication (source code)
for (dest=1; dest<=numworkers; dest++)
{
/* To each worker send : Start point, number of rows to process, and sub-arrays to process */
rows = (dest <= extra) ? averow+1 : averow;
printf("Sending %d rows to task %d offset=%d\n",rows,dest,offset);
MPI_Send(&offset, 1, MPI_INT, dest, mtype, MPI_COMM_WORLD);
MPI_Send(&rows, 1, MPI_INT, dest, mtype, MPI_COMM_WORLD);
MPI_Send(&a[offset][0], rows*NCA, MPI_DOUBLE, dest, mtype, MPI_COMM_WORLD);
MPI_Send(&b, NCA*NCB, MPI_DOUBLE, dest, mtype, MPI_COMM_WORLD);
offset = offset + rows;
}
/* Receive results from worker tasks */
mtype = FROM_WORKER; /* Message tag for messages sent by “workers” */
for (i=1; i<=numworkers; i++)
{
source = i;
/* offset stores the (processing) starting point of work chunk */
MPI_Recv(&offset, 1, MPI_INT, source, mtype, MPI_COMM_WORLD, &status);
MPI_Recv(&rows, 1, MPI_INT, source, mtype, MPI_COMM_WORLD, &status);
/* The array C contains the product of sub-array A and the array B */
MPI_Recv(&c[offset][0], rows*NCB, MPI_DOUBLE, source, mtype, MPI_COMM_WORLD, &status);
printf("Received results from task %d\n",source);
}
printf("******************************************************\n");
printf("Result Matrix:\n");
for (i=0; i<NRA; i++)
{
printf("\n");
for (j=0; j<NCB; j++)
printf("%6.2f ", c[i][j]);
}
printf("\n******************************************************\n");
printf ("Done.\n");
}
63
Matrix Multiplication (source code)
/**************************** worker task ************************************/
if (taskid > MASTER)
{
mtype = FROM_MASTER;
MPI_Recv(&offset, 1, MPI_INT, MASTER, mtype, MPI_COMM_WORLD, &status);
MPI_Recv(&rows, 1, MPI_INT, MASTER, mtype, MPI_COMM_WORLD, &status);
MPI_Recv(&a, rows*NCA, MPI_DOUBLE, MASTER, mtype, MPI_COMM_WORLD, &status);
MPI_Recv(&b, NCA*NCB, MPI_DOUBLE, MASTER, mtype, MPI_COMM_WORLD, &status);
for (k=0; k<NCB; k++)
for (i=0; i<rows; i++)
{
c[i][k] = 0.0;
for (j=0; j<NCA; j++)
/* Calculate the product and store result in C */
c[i][k] = c[i][k] + a[i][j] * b[j][k];
}
mtype = FROM_WORKER;
MPI_Send(&offset, 1, MPI_INT, MASTER, mtype, MPI_COMM_WORLD);
MPI_Send(&rows, 1, MPI_INT, MASTER, mtype, MPI_COMM_WORLD);
/* Worker sends the resultant array to the master */
MPI_Send(&c, rows*NCB, MPI_DOUBLE, MASTER, mtype, MPI_COMM_WORLD);
}
MPI_Finalize();
}
Source : http://www.llnl.gov/computing/tutorials/mpi/samples/C/mpi_mm.c
64
Demo : Matrix Multiplication
[cdekate@compute-0-6 matrix_multiplication]$ mpirun -np 4 ./mpi_mm
mpi_mm has started with 4 tasks.
Initializing arrays...
Sending 21 rows to task 1 offset=0
Sending 21 rows to task 2 offset=21
Sending 20 rows to task 3 offset=42
Received results from task 1
Received results from task 2
Received results from task 3
******************************************************
Result Matrix:
0.00 1015.00 2030.00 3045.00 4060.00 5075.00 6090.00
0.00 1120.00 2240.00 3360.00 4480.00 5600.00 6720.00
0.00 1225.00 2450.00 3675.00 4900.00 6125.00 7350.00
0.00 1330.00 2660.00 3990.00 5320.00 6650.00 7980.00
0.00 1435.00 2870.00 4305.00 5740.00 7175.00 8610.00
0.00 1540.00 3080.00 4620.00 6160.00 7700.00 9240.00
0.00 1645.00 3290.00 4935.00 6580.00 8225.00 9870.00
…
…
…
0.00 6475.00 12950.00 19425.00 25900.00 32375.00 38850.00
0.00 6580.00 13160.00 19740.00 26320.00 32900.00 39480.00
0.00 6685.00 13370.00 20055.00 26740.00 33425.00 40110.00
0.00 6790.00 13580.00 20370.00 27160.00 33950.00 40740.00
0.00 6895.00 13790.00 20685.00 27580.00 34475.00 41370.00
0.00 7000.00 14000.00 21000.00 28000.00 35000.00 42000.00
0.00 7105.00 14210.00 21315.00 28420.00 35525.00 42630.00
0.00 7210.00 14420.00 21630.00 28840.00 36050.00 43260.00
0.00 7315.00 14630.00 21945.00 29260.00 36575.00 43890.00
0.00 7420.00 14840.00 22260.00 29680.00 37100.00 44520.00
******************************************************
Done.
[cdekate@compute-0-6 matrix_multiplication]$
65
Topics
•
•
•
•
•
•
•
Introduction
Array Decomposition
Mandelbrot Sets
Monte Carlo : PI Calculation
Matrix Multiplication
N-Body Problem
Summary – Materials for Test
66
N Bodies
OU Supercomputing Center for Education & Research
67
N-Body Problems
An N-body problem is a problem involving
N “bodies” – that is, particles (e.g., stars,
atoms) – each of which applies a force to all of
the others.
For example, if you have N stars, then each of
the N stars exerts a force (gravity) on all of the
other N–1 stars.
Likewise, if you have N atoms, then every atom
exerts a force (nuclear) on all of the other N–1
atoms.
OU Supercomputing Center for Education & Research
68
1-Body Problem
When N is 1, you have a simple 1-Body Problem:
a
single particle, with no forces acting on it.
Given the particle’s position P and velocity V at some
time t0, you can trivially calculate the particle’s
position at time t0+Δt:
P(t0+Δt) = P(t0) + VΔt
V(t0+Δt) = V(t0)
OU Supercomputing Center for Education & Research
69
2-Body Problem
When N is 2, you have – surprise! – a 2-Body Problem:
exactly two particles, each exerting a force that acts
on the other.
The relationship between the 2 particles can be
expressed as a differential equation that can be
solved analytically, producing a closed-form solution.
So, given the particles’ initial positions and velocities,
you can immediately calculate their positions and
velocities at any later time.
OU Supercomputing Center for Education & Research
70
N-Body Problems
For N of 3 or more, no one knows how to solve the
equations to get a closed form solution.
So, numerical simulation is pretty much the only way to
study groups of 3 or more bodies.
Popular applications of N-body codes include astronomy
and chemistry.
Note that, for N bodies, there are on the order of N2
forces, denoted O(N2).
OU Supercomputing Center for Education & Research
71
N Bodies
OU Supercomputing Center for Education & Research
72
N-Body Problems
Given N bodies, each body exerts a force on all of the
other N–1 bodies.
Therefore, there are N • (N–1) forces in total.
You can also think of this as (N • (N–1))/2 forces, in the
sense that the force from particle A to particle B is the
same (except in the opposite direction) as the force
from particle B to particle A.
OU Supercomputing Center for Education & Research
73
Aside: Big-O Notation
Let’s say that you have some task to perform on a
certain number of things, and that the task takes a
certain amount of time to complete.
Let’s say that the amount of time can be expressed as a
polynomial on the number of things to perform the task
on.
For example, the amount of time it takes to read a book
might be proportional to the number of words, plus the
amount of time it takes to sit in your favorite easy
chair.
C1
.N+C
2
OU Supercomputing Center for Education & Research
74
Big-O: Dropping the Low Term
.
C1 N + C 2
When N is very large, the time spent settling into your easy chair
becomes such a small proportion of the total time that it’s
virtually zero.
So from a practical perspective, for large N, the polynomial reduces
to:
.
C1 N
In fact, for any polynomial, all of the terms except the highest-order
term are irrelevant, for large N.
OU Supercomputing Center for Education & Research
75
Big-O: Dropping the Constant
.
C1 N
Computers get faster and faster all the time. And there
are many different flavors of computers, having many
different speeds.
So, computer scientists don’t care about the constant,
only about the order of the highest-order term of the
polynomial.
They indicate this with Big-O notation:
O(N)
This is often said as: “of order N.”
OU Supercomputing Center for Education & Research
76
N-Body Problems
Given N bodies, each body exerts a force on all of the
other N–1 bodies.
Therefore, there are N • (N–1) forces in total.
In Big-O notation, that’s O(N2) forces to calculate.
So, calculating the forces takes O(N2) time to execute.
But, there are only N particles, each taking up the same
amount of memory, so we say that N-body codes are
of:
• O(N) spatial complexity (memory)
• O(N2) time complexity
OU Supercomputing Center for Education & Research
77
O(N2) Forces
A
Note that this picture shows only the forces between A and everyone else.
OU Supercomputing Center for Education & Research
78
How to Calculate?
Whatever your physics is, you have some function,
F(A,B), that expresses the force between two bodies
A and B.
For example,
F(A,B) = G · dist(A,B)2 · mA · mB
where G is the gravitational constant and m is the mass
of the particle in question.
If you have all of the forces for every pair of particles,
then you can calculate their sum, obtaining the force
on every particle.
OU Supercomputing Center for Education & Research
79
How to Parallelize?
Okay, so let’s say you have a nice serial (single-CPU)
code that does an N-body calculation.
How are you going to parallelize it?
You could:
• have a master feed particles to processes;
• have a master feed interactions to processes;
• have each process decide on its own subset of the
particles, and then share around the forces;
• have each process decide its own subset of the
interactions, and then share around the forces.
OU Supercomputing Center for Education & Research
80
Do You Need a Master?
Let’s say that you have N bodies, and therefore you
have ½N(N-1) interactions (every particle interacts
with all of the others, but you don’t need to calculate
both A  B and B  A).
Do you need a master?
Well, can each processor determine on its own either
(a) which of the bodies to process, or (b) which of the
interactions?
If the answer is yes, then you don’t need a master.
OU Supercomputing Center for Education & Research
81
Parallelize How?
Suppose you have P processors.
Should you parallelize:
• by assigning a subset of N/P of the bodies to each
processor, or
• by assigning a subset of ½N(N-1)/P of the
interactions to each processor?
OU Supercomputing Center for Education & Research
82
N-Body “Pipeline” Implementation
Flowchart
Initialize MPI environment
Create ring communicator
Initiate transmission of send buffer
to the RIGHT neighbor in ring
Initialize particle parameters
Initiate reception of data from the
LEFT neighbor in ring
Copy local particle data to
send buffer
Compute forces between local and
send buffer particles
Update positions of local
particles
Wait for message exchange to
complete
Copy particle data from receive
buffer to send buffer
N
All iterations
done?
Y
Finalize MPI
Processed
particles from all
remote nodes?
N
Y
83
N-Body (source code)
#include "mpi.h"
#include <stdlib.h>
#include <stdio.h>
#include <string.h>
#include <math.h>
/* Pipeline version of the algorithm... */
/* we really need the velocities as well… */
/* Simplified structure describing parameters of a single particle */
typedef struct {
double x, y, z;
double mass;
} Particle;
/* We use leapfrog for the time integration ... */
/* Structure to hold force components and old position coordinates of a particle */
typedef struct {
double xold, yold, zold;
double fx, fy, fz;
} ParticleV;
void InitParticles( Particle[], ParticleV [], int );
double ComputeForces( Particle [], Particle [], ParticleV [], int );
double ComputeNewPos( Particle [], ParticleV [], int, double, MPI_Comm );
#define MAX_PARTICLES 4000
#define MAX_P
128
84
N-Body (source code)
main( int argc, char *argv[] )
{
Particle particles[MAX_PARTICLES]; /* Particles on ALL nodes */
ParticleV pv[MAX_PARTICLES];
/* Particle velocity */
Particle sendbuf[MAX_PARTICLES], /* Pipeline buffers */
recvbuf[MAX_PARTICLES];
MPI_Request request[2];
int
counts[MAX_P],
/* Number on each processor */
displs[MAX_P];
/* Offsets into particles */
int
rank, size, npart, i, j,
offset;
/* location of local particles */
int
totpart,
/* total number of particles */
cnt;
/* number of times in loop */
MPI_Datatype particletype;
double
sim_t;
/* Simulation time */
double
time;
/* Computation time */
int
pipe, left, right, periodic;
MPI_Comm commring;
MPI_Status statuses[2];
/* Initialize MPI Environment */
MPI_Init( &argc, &argv );
MPI_Comm_rank( MPI_COMM_WORLD, &rank );
MPI_Comm_size( MPI_COMM_WORLD, &size );
/* Create 1-dimensional periodic Cartesian communicator (a ring) */
periodic = 1;
MPI_Cart_create( MPI_COMM_WORLD, 1, &size, &periodic, 1, &commring );
85
N-Body (source code)
MPI_Cart_shift( commring, 0, 1, &left, &right ); /* Find the closest neighbors in ring */
/* Calculate local fraction of particles */
if (argc < 2) {
fprintf( stderr, "Usage: %s n\n", argv[0] );
MPI_Abort( MPI_COMM_WORLD, 1 );
}
npart = atoi(argv[1]) / size;
if (npart * size > MAX_PARTICLES) {
fprintf( stderr, "%d is too many; max is %d\n", npart*size, MAX_PARTICLES );
MPI_Abort( MPI_COMM_WORLD, 1 );
}
MPI_Type_contiguous( 4, MPI_DOUBLE, &particletype ); /* Data type corresponding to Particle struct */
MPI_Type_commit( &particletype );
/* Get the sizes and displacements */
MPI_Allgather( &npart, 1, MPI_INT, counts, 1, MPI_INT, commring );
displs[0] = 0;
for (i=1; i<size; i++)
displs[i] = displs[i-1] + counts[i-1];
totpart = displs[size-1] + counts[size-1];
/* Generate the initial values */
InitParticles( particles, pv, npart);
offset = displs[rank];
cnt = 10;
time = MPI_Wtime();
sim_t = 0.0;
/* Begin simulation loop */
while (cnt--) {
double max_f, max_f_seg;
86
N-Body (source code)
/* Load the initial send buffer */
memcpy( sendbuf, particles, npart * sizeof(Particle) );
max_f = 0.0;
for (pipe=0; pipe<size; pipe++) {
if (pipe != size-1) {
/* Initialize send to the “right” neighbor, while receiving from the “left” */
MPI_Isend( sendbuf, npart, particletype, right, pipe, commring, &request[0] );
MPI_Irecv( recvbuf, npart, particletype, left, pipe, commring, &request[1] );
}
/* Compute forces */
max_f_seg = ComputeForces( particles, sendbuf, pv, npart );
if (max_f_seg > max_f) max_f = max_f_seg;
/* Wait for updates to complete and copy received particles to the send buffer */
if (pipe != size-1) MPI_Waitall( 2, request, statuses );
memcpy( sendbuf, recvbuf, counts[pipe] * sizeof(Particle) );
}
/* Compute the changes in position using the already calculated forces */
sim_t += ComputeNewPos( particles, pv, npart, max_f, commring );
/* We could do graphics here (move particles on the display) */
}
time = MPI_Wtime() - time;
if (rank == 0) {
printf( "Computed %d particles in %f seconds\n", totpart, time );
}
MPI_Finalize();
return 0;
}
87
N-Body (source code)
/* Initialize particle positions, masses and forces */
void InitParticles( Particle particles[], ParticleV pv[], int npart )
{
int i;
for (i=0; i<npart; i++) {
particles[i].x = drand48();
particles[i].y = drand48();
particles[i].z = drand48();
particles[i].mass = 1.0;
pv[i].xold
= particles[i].x;
pv[i].yold
= particles[i].y;
pv[i].zold
= particles[i].z;
pv[i].fx
= 0;
pv[i].fy
= 0;
pv[i].fz
= 0;
}
}
/* Compute forces (2-D only) */
double ComputeForces( Particle myparticles[], Particle others[], ParticleV pv[], int npart )
{
double max_f, rmin;
int i, j;
max_f = 0.0;
for (i=0; i<npart; i++) {
double xi, yi, mi, rx, ry, mj, r, fx, fy;
rmin = 100.0;
xi = myparticles[i].x;
yi = myparticles[i].y;
fx = 0.0;
fy = 0.0;
88
N-Body (source code)
for (j=0; j<npart; j++) {
rx = xi - others[j].x;
ry = yi - others[j].y;
mj = others[j].mass;
r = rx * rx + ry * ry;
/* ignore overlap and same particle */
if (r == 0.0) continue;
if (r < rmin) rmin = r;
/* compute forces */
r = r * sqrt(r);
fx -= mj * rx / r;
fy -= mj * ry / r;
}
pv[i].fx += fx;
pv[i].fy += fy;
/* Compute a rough estimate of (1/m)|df / dx| */
fx = sqrt(fx*fx + fy*fy)/rmin;
if (fx > max_f) max_f = fx;
}
return max_f;
}
/* Update particle positions (2-D only) */
double ComputeNewPos( Particle particles[], ParticleV pv[], int npart, double max_f, MPI_Comm commring )
{
int i;
double
a0, a1, a2;
static
double dt_old = 0.001, dt = 0.001;
double
dt_est, new_dt, dt_new;
89
N-Body (source code)
/* integation is a0 * x^+ + a1 * x + a2 * x^- = f / m */
a0 = 2.0 / (dt * (dt + dt_old));
a2 = 2.0 / (dt_old * (dt + dt_old));
a1 = -(a0 + a2);
/* also -2/(dt*dt_old) */
for (i=0; i<npart; i++) {
double xi, yi;
/* Very, very simple leapfrog time integration. We use a variable step version to simplify time-step control. */
xi = particles[i].x;
yi = particles[i].y;
particles[i].x = (pv[i].fx - a1 * xi - a2 * pv[i].xold) / a0;
particles[i].y = (pv[i].fy - a1 * yi - a2 * pv[i].yold) / a0;
pv[i].xold = xi;
pv[i].yold = yi;
pv[i].fx = 0;
pv[i].fy = 0;
}
/* Recompute a time step. Stability criteria is roughly 2/sqrt(1/m |df/dx|) >= dt. We leave a little room */
dt_est = 1.0/sqrt(max_f);
if (dt_est < 1.0e-6) dt_est = 1.0e-6;
MPI_Allreduce( &dt_est, &dt_new, 1, MPI_DOUBLE, MPI_MIN, commring );
/* Modify time step */
if (dt_new < dt) {
dt_old = dt;
dt = dt_new;
}
else if (dt_new > 4.0 * dt) {
dt_old = dt;
dt *= 2.0;
}
return dt_old;
}
90
Demo : N-Body Problem
> mpirun –np 4 nbodypipe 4000
Computed 4000 particles in 1.119051 seconds
91
Topics
•
•
•
•
•
•
•
Introduction
Array Decomposition
Mandelbrot Sets
Monte Carlo : PI Calculation
Matrix Multiplication
N-Body Problem
Summary – Materials for Test
92
Summary – Material for the Test
•
•
•
•
•
Introduction – Slides: 4, 5, 6
Array decomposition – Slides: 11, 12
Mandelbrot load balancing – Slides: 25, 26
Monte Carlo create Communicators – Slides: 40, 42
Matrix Multiply basic algorithm – Slides: 49 – 54
93
94