Lecture 5: MPI - Non-blocking Communications

Transcript Lecture 5: MPI - Non-blocking Communications

Non-Blocking Communications

#include #include int main(int argc, char **argv) { int my_rank, ncpus; int left_neighbor, right_neighbor; int data_received=-1; int tag = 101; MPI_Status statSend, statRecv; MPI_Request reqSend, reqRecv;

Example

mpirun –np 4 test_shift Among 4 processes, process 3 received from right neighbor: 0 Among 4 processes, process 2 received from right neighbor: 3 Among 4 processes, process 0 received from right neighbor: 1 Among 4 processes, process 1 received from right neighbor: 2 MPI_Init(&argc, &argv); MPI_Comm_rank(MPI_COMM_WORLD, &my_rank); MPI_Comm_size(MPI_COMM_WORLD, &ncpus); left_neighbor = (my_rank-1 + ncpus)%ncpus; right_neighbor = (my_rank+1)%ncpus; MPI_Isend(&my_rank,1,MPI_INT,left_neighbor,tag,MPI_COMM_WORLD,&reqSend); // comm start MPI_Irecv(&data_received,1,MPI_INT,right_neighbor,tag,MPI_COMM_WORLD,&reqRecv); // maybe do something useful here MPI_Wait(&reqSend, &statSend); // complete comm MPI_Wait(&reqRecv, &statRecv); printf("Among %d processes, process %d received from right neighbor: %d\n", ncpus, my_rank, data_received); } // clean up MPI_Finalize(); return 0; 2

Semantics etc



Purpose:

 Mechanism for overlapping communication and useful computations . Communication and computation may proceed concurrently. Latency hiding.

 Deadlock avoidance  May avoid system buffering and memory-to-memory copying, and improve performance 

Structure of non-blocking calls

Post communication requests  … // do some useful work Complete communication call  non-blocking call, MPI_Isend … MPI_Wait, MPI_Test, … 3

Semantics etc

 Non-blocking calls: MPI_Isend , MPI_Irecv etc  Will return immediately. Merely post a request to system to initiate communication.

 However, communication is not completed yet.

 Cannot tamper with the memory provided in these calls until the communication is completed by calling MPI_Wait or MPI_Test etc Non-blocking send Non-blocking receive 4

Non-blocking Send/Recv

int MPI_Isend( void *buf, int count, MPI_Datatype datatype , int dest, int tag, MPI_Comm comm BUF(*) , MPI_Request *request) MPI_ISEND(BUF,COUNT,DATATYPE,DEST,TAG,COMM,REQUEST,IERROR) INTEGER COUNT,DATATYPE,DEST,TAG,COMM,REQUEST, IERROR int MPI_Irecv( void *buf, int count, MPI_Datatype datatype , int source, int tag, MPI_Comm comm BUF(*) , MPI_Request *request) MPI_IRECV(BUF,COUNT,DATATYPE,SOURCE,TAG,COMM,REQUEST,IERROR) INTEGER COUNT,DATATYPE,SOURCE,TAG,COMM,REQUEST,IERROR Post send/recv requests to MPI system. Calls return immediately, but don’t access the memory pointed to by *buf MPI_Request request is a handle to an internal MPI object. Everything about that non-blocking communication is through that handle. MPI_REQUEST_NULL is a NULL request.

MPI_Request req1, req2; double A[10], B[5]; … MPI_Isend(A, 10, MPI_DOUBLE, rank, tag, MPI_COMM_WORLD, &req1); MPI_Irecv(B, 5, MPI_DOUBLE, rank, tag, MPI_COMM_WORLD, &req2); 5

Other Non-blocking Sends



4 communication modes, same semantics as blocking sends.



MPI_ISEND – standard mode



MPI_IBSEND – buffered mode



MPI_ISSEND – synchronous mode



MPI_IRSEND – ready mode

Identical arguments as MPI_Isend int MPI_Ibsend( void *buf,int count,MPI_Datatype datatype , int dest, int tag, MPI_Comm comm , MPI_Request *request) int MPI_Issend( void *buf,int count,MPI_Datatype datatype , int dest, int tag, MPI_Comm comm , MPI_Request *request) int MPI_Irsend( void *buf,int count,MPI_Datatype datatype , int dest, int tag, MPI_Comm comm , MPI_Request *request) 6

Completion



Use MPI_Wait or MPI_Test to complete non-blocking communication



Semantics: after MPI_Wait returns



For standard send, message data has been safely stored away, safe to access buffer.



For receive, data is received.

MPI_Wait

int MPI_Wait(MPI_Request *request, MPI_Status *status) MPI_WAIT(REQUEST,STATUS,IERROR) INTEGER REQUEST, STATUS(MPI_STATUS_SIZE), IERROR *request is a handle returned from MPI_Isend, MPI_Irecv etc  Will block until the communication completes (or fails)  If request is from MPI_Isend , MPI_Irecv etc  Will deallocate request object, set request MPI_REQUEST_NULL .

to  Will return in status the status information.  for MPI_Irecv , hold additional information.

 For MPI_Isend , not much to be used MPI_Request req; MPI_Status stat; … MPI_Irecv(…, &req); MPI_Wait(&req, &stat); 8

MPI_Test

int MPI_Test(MPI_Request *request, int *flag, MPI_Status *status) MPI_TEST(REQUEST,FLAG,STATUS,IERROR) LOGICAL FLAG INTEGER REQUEST, STATUS, IERROR    request – MPI_Request object from MPI_Isend , etc flag – true if communication complete; false if not yet  If true, request object will be de-allocated, and set to MPI_REQUEST_NULL status – contain status information if complete  Does not block, return immediately.

 Provide a mechanism for overlapping communication and computation  Do useful computation; periodically check communication status; if not complete, go back to computation.

Properties

 Order: non-overtaking, order preserved  according to the execution order of non-blocking calls that initiate the communications  Progress: guarantees progress  Receive call completed by MPI_Wait there is a matching send.

will eventually return if  Send call completed by MPI_Wait will eventually return if there is a matching receive.

MPI_Comm_rank(comm,&rank); If(rank==0) { MPI_Isend(A,1,MPI_DOUBLE,1,99,comm,&req1); MPI_Isend(B,1,MPI_DOUBLE,1,99,comm,&req2); } Else if(rank==1) { MPI_Irecv(A,1,MPI_DOUBLE,0,MPI_ANY_TAG,comm,&req1); MPI_Irecv(B,1,MPI_DOUBLE,0,99,comm,&req2); } MPI_Wait(&req1,&stat1); MPI_Wait(&req2,&stat2); 10

   

MPI_Wait Variants

Deal with arrays of MPI_Requests: MPI_Request req[4]; MPI_Waitall:  MPI_Waitall(int count, MPI_Request *request, MPI_Status *status)   Blocks until all active requests in array complete; return status of all communications Deallocate request objects, set to MPI_REQUEST_NULL MPI_Waitany:  MPI_Waitany(int count,MPI_Request *req, int *index, MPI_Status *stat)  Blocks until one of the active requests in array completes; return its index in array and the status of completing request; deallocate that request object. If none completes, return index=MPI_UNDEFINED .

MPI_Waitsome:  MPI_Waitsome(int incount, MPI_Request *req, int *outcount, int *array_indices, MPI_Status *array_status)  Blocks until at least one of the active communications completes; return associated indices and status of completed communications; deallocate objects. If none, outcount=MPI_UNDEFINED .

MPI_Request req[2]; MPI_Status stat[2]; … MPI_Isend(…, &req[0]); MPI_Isend(…, &req[1]); MPI_Waitall(2, req, stat); MPI_Request req[2]; MPI_Status stat; Int index; MPI_Isend(…, &req[0]); MPI_Isend(…, &req[1]); MPI_Waitany(2, req, &index, &stat); … 11

MPI_Test Variants

   MPI_Testall:  MPI_Testall(int count, MPI_Request *array_req, int *flag, MPI_Status *array_stat)  Return flag=true otherwise.

if all active requests complete; return flag=false  If true, will de-allocate request objects, set to MPI_REQUEST_NULL .

MPI_Testany:  MPI_Testany(int count, MPI_Request *array_req, int *index, int *flag, MPI_Status *stat)  If one of active comm completes, return flag=true status of completing comm; deallocate that object.

the index and   Return flag=false, index=MPI_UNDEFINED Return flag=true, index=MPI_UNDEFINED if none completes if none active requests.

MPI_Testsome:  MPI_Testsome(int incount, MPI_Request *array_req, int *outcount, int *array_indices, MPI_Status *array_stat)  Return in outcount the number of completed active comm and associated indices and status of completing comm.

  If none completes, return outcount=0 if none active comm, outcount=MPI_UNDEFINED .

Persistent Communication

 Structure for nonblocking calls:  MPI_Ixxxx allocates MPI_Request  MPI_Wait or MPI_Test completes and de-allocates request objects  Often a communication with same arguments is executed repeatedly  e.g. every time step or every iteration.  Can create a persistent request that will not be de allocated by MPI_Wait. Reduce overhead Create persistent request  MPI_Send_init , MPI_Recv_init Repeat: Start communication  MPI_Start … Complete communication  Free persistent request  MPI_Wait , MPI_Test MPI_Request_free 13

Creation

int MPI_Send_init(void *buf, int count, MPI_Datatype datatype, int dest, int tag, MPI_Comm comm, MPI_Request *req) int MPI_Recv_init(void *buf, int count, MPI_Datatype datatype, int source, int tag, MPI_Comm comm, MPI_Request *req)    Creates a persistent request object for standard send mode.

Bind to the arguments: buf, count, datatype, dest, tag, comm. These arguments will not change in following communications On creation, request inactive – not associated with any active communication. Communication initiated by MPI_Start MPI_Request req_send, req_recv; double A[100], B[100]; int left_neighbor, right_neighbor, tag=999; MPI_Status stat_send, stat_recv; … MPI_Send_init(A,100,MPI_DOUBLE,left_neighbor,tag,MPI_COMM_WORLD,&req_send); MPI_Recv_init(B,100,MPI_DOUBLE,right_neighbor,tag,MPI_COMM_WORLD,&req_recv); MPI_Start(&req_send); MPI_Start(&req_recv); … // do something else useful MPI_Wait(&req_send, &stat_send); MPI_Wait(&req_recv, &stat_recv); MPI_Request_free(&req_send); MPI_Request_free(&req_recv); 14

Start Communication, Free Request

int MPI_Start(MPI_Request *request) MPI_START(REQUEST) integer REQUEST  request is a persistent request created by MPI_Send_init etc.

 Start the communication on request object.

 The call returns immediately. It starts a non-blocking communication . Should not access the buffer after this call until completion.

 Complete communication by MPI_Wait , MPI_Test  MPI_Wait , MPI_Test will not de-allocate the request upon completion of communication etc.

 De-allocate persistent request using MPI_Request_free in the end.

int MPI_Request_free(MPI_Request *request) MPI_REQUEST_FREE(request) integer REQUEST 15

cpu 0

Example: Matrix-Vector Multiplication

A11 A A12 A13 X X1 Y Y1 AX=Y A – NxN matrix X,Y – vectors, dimension N cpu 1 A21 A22 A23 X2 = Y2 cpu 2 A31 A32 A33 X3 Y3 Y1 = A11*X1 + A12*X2 + A13*X3 Y2 = A21*X1 + A22*X2 + A23*X3 Y3 = A31*X1 + A32*X2 + A33*X3 cpu 0 cpu 1 cpu 2 A11 A12 A13 A21 A22 A23 A31 A32 A33 X2 X3 = Y1 Y2 X1 Y3 Y1 = A11*X1 + A12*X2 + A13*X3 Y2 = A21*X1 + A22*X2 + A23*X3 Y3 = A31*X1 + A32*X2 + A33*X3 cpu 0 cpu 1 cpu 2 A11 A12 A13 A21 A22 A23 A31 A32 A33 X3 X1 = Y1 Y2 X2 Y3 Y1 = A11*X1 + A12*X2 + A13*X3 Y2 = A21*X1 + A22*X2 + A23*X3 Y3 = A31*X1 + A32*X2 + A33*X3 16

Example: Matrix-Vector

Data on cpu 0: [A11 A12 A13]  X1  Y1  N/3 x N matrix vector, length N/3 vector, length N/3 Data on cpu 1: [A21 A22 A23]  X2  N/3 x N matrix vector, length N/3 Y2  vector, length N/3 Data on cpu 2: [A31 A32 A33]  X3  N/3 x N matrix vector, length N/3 Y3  vector, length N/3 Need to communicate: X1, X2, X3 Upward shift. Number of shifts = ncpus-1 Assume: A[i][j] = i+j X[i] = i 17

#include #include #include #include "dmath.h“ //  ignore this for now #define DIM 1000 // logical A[DIM][DIM], X[DIM], Y[DIM]

Example

(non-blocking comm)

int main(int argc, char **argv) { int ncpus, my_rank, left_neighbor, right_neighbor, tag=1001; int Nx, Ny; // Ny=DIM, Nx=DIM/ncpus, on each cpu: A[Nx][Ny], X[Nx], Y[Nx] MPI_Request req_sr[2]; MPI_Status stat_sr[2]; double **A, *X, *Y, *Xt; MPI_Init(&argc, &argv); MPI_Comm_rank(MPI_COMM_WORLD, &my_rank); MPI_Comm_size(MPI_COMM_WORLD, &ncpus); if(DIM%ncpus != 0) { // assume DIM dividable by ncpus if(my_rank==0) printf("ERROR: grid size cannot be divided by ncpus!\n"); MPI_Finalize(); return -1; } Nx = DIM/ncpus; // again on each cpu: A[Nx][Ny] etc Ny = DIM; left_neighbor = (my_rank-1 + ncpus)%ncpus; // top neighbor right_neighbor = (my_rank+1)%ncpus; // bottom neighbor A = DMath::newD(Nx, Ny); // allocate memory, ignore DMath – my own routine X = DMath::newD(Nx); Xt = DMath::newD(Nx); // Xt – temporary space for receiving from neighbor Y = DMath::newD(Nx); 18

int i,j; for(i=0;i

Example

int count; // loop counter int sindex, curr_block; memset(Y, '\0', sizeof(double)*Nx); // zero out result vector Y first for(count=0;count

// clean up, free memory DMath::del(A); // Ignore DMath for now DMath::del(X); DMath::del(Xt); DMath::del(Y); } MPI_Finalize(); return 0;

Example

Example: Persistent Communication

...

MPI_Recv_init(Xt, Nx, MPI_DOUBLE,right_neighbor,tag,MPI_COMM_WORLD,&req_sr[0]); MPI_Send_init(X, Nx, MPI_DOUBLE, left_neighbor, tag, MPI_COMM_WORLD, &req_sr[1]); for(count=0;count

Example: Send-Recv

...

for(count=0;count

HWK#2: Matrix Multiplication

C A B11 B B12 B13 A1 A2 A3 B21 B22 B23 = B31 B32 B33 C1 C2 C3 C1 = A1*B11 + A2*B21 + A3*B31 cpu 0 C2 = A1*B12 + A2*B22 + A3*B32 cpu 1 C3 = A1*B13 + A2*B23 + A3*B33 cpu 2 Column-wise decomposition A, B, C – NxN matrices P – number of processors A1, A2, A3 – Nx(N/P) matrices C1, C2, C3 - … Bij – (N/P)x(N/P) matrices Input: A[i][j] = 2*i + j B[i][j] = 2*i – j 23

HWK #2

    Implement the above parallel matrix multiplication (column-wise data decomposition) in either C, C++ or Fortran  Use non-blocking communication or persistent communication in MPI Test your parallel implementation and make sure the result is correct  Result for matrix C on p CPUs must be identical to that on 1 CPU Use a matrix size 2048x2048 (double)  Time the “multiplication section” of your code using MPI_Wtime() routine for wall-clock time.

 Run your code on 1, 2, 4, 8, 16 CPUs and obtain the wall-clock time it takes: T1, T2, …, T16  Compute parallel speedup factors: Sp = T1/Tp, e.g. Sp=T1/T8 for 8 CPUs.

 Plot Sp vs. number of CPUs.

Turn in:  Source code + compiled binary code on either hamlet or radon.

   Table of wall-clock time vs. number of CPUs.

Plot of parallel speedup factors.

Write-up of what you have learned from the implementation and timing results 

Due date: Oct. 11

Collective Communications

Overview

 All processes in a group participate in communication, by calling the same function with matching arguments.  Types of collective operations:  Synchronization: MPI_Barrier  Data movement: MPI_Bcast , MPI_Scatter , MPI_Gather , MPI_Allgather , MPI_Alltoall  Collective computation: MPI_Reduce , MPI_Allreduce , MPI_Scan  Collective routines are blocking:    Completion of call means the communication buffer can be accessed No indication on other processes’ status of completion May or may not have effect of synchronization among processes.

Overview



Can use same communicators as PtP communications

 MPI guarantees messages from collective communications will not be confused with PtP communications.



Key is a group of processes partaking communication

 If you want only a sub-group of processes involved in collective communication, need to create a sub group/sub-communicator from MPI_COMM_WORLD 27

Barrier

int MPI_Barrier(MPI_Comm comm) MPI_BARRIER(COMM,IERROR) integer COMM, IERROR 

Blocks the calling process until all group members have called it.



Decreases performance. Refrain from using it explicitly.

… MPI_Barrier(MPI_COMM_WORLD); // synchronization point … 28

Broadcast

int MPI_Bcast(void *buffer, int count, MPI_Datatype datatype,int root, MPI_Comm comm) MPI_BCAST(BUFFER, COUNT, DATATYPE, ROOT, COMM) integer BUFFER, COUNT, DATATYPE, ROOT, COMM    Broadcasts a message from process with rank root to all processes in group, including itself.

comm , root must be the same in all processes The amount of data sent must be equal to amount of data received, pairwise between each process and the root  For now, means count and datatype must be the same for all processes; may be different when generalized datatypes are involved.

Lecture 5: MPI - Non-blocking Communications

Transcript Lecture 5: MPI - Non-blocking Communications

Non-Blocking Communications

Example

Semantics etc

Purpose:

Structure of non-blocking calls

Semantics etc

Non-blocking Send/Recv

Other Non-blocking Sends

4 communication modes, same semantics as blocking sends.

MPI_ISEND – standard mode

MPI_IBSEND – buffered mode

MPI_ISSEND – synchronous mode

MPI_IRSEND – ready mode

Completion

Use MPI_Wait or MPI_Test to complete non-blocking communication

Semantics: after MPI_Wait returns

For standard send, message data has been safely stored away, safe to access buffer.

For receive, data is received.

MPI_Wait

MPI_Test

Properties

MPI_Wait Variants

MPI_Test Variants

Persistent Communication

Creation

Start Communication, Free Request

Example: Matrix-Vector Multiplication

Example: Matrix-Vector

Example

Example

Example

Example: Persistent Communication

Example: Send-Recv

HWK#2: Matrix Multiplication

HWK #2

Collective Communications

Overview

Overview

Can use same communicators as PtP communications

Key is a group of processes partaking communication

Barrier

Blocks the calling process until all group members have called it.

Decreases performance. Refrain from using it explicitly.

Broadcast

Directory