Transcript Lecture 5: MPI - Non-blocking Communications
Non-Blocking Communications
1
#include
Example
mpirun –np 4 test_shift Among 4 processes, process 3 received from right neighbor: 0 Among 4 processes, process 2 received from right neighbor: 3 Among 4 processes, process 0 received from right neighbor: 1 Among 4 processes, process 1 received from right neighbor: 2 MPI_Init(&argc, &argv); MPI_Comm_rank(MPI_COMM_WORLD, &my_rank); MPI_Comm_size(MPI_COMM_WORLD, &ncpus); left_neighbor = (my_rank-1 + ncpus)%ncpus; right_neighbor = (my_rank+1)%ncpus; MPI_Isend(&my_rank,1,MPI_INT,left_neighbor,tag,MPI_COMM_WORLD,&reqSend); // comm start MPI_Irecv(&data_received,1,MPI_INT,right_neighbor,tag,MPI_COMM_WORLD,&reqRecv); // maybe do something useful here MPI_Wait(&reqSend, &statSend); // complete comm MPI_Wait(&reqRecv, &statRecv); printf("Among %d processes, process %d received from right neighbor: %d\n", ncpus, my_rank, data_received); } // clean up MPI_Finalize(); return 0; 2
Semantics etc
Purpose:
Mechanism for overlapping communication and useful computations . Communication and computation may proceed concurrently. Latency hiding.
Deadlock avoidance May avoid system buffering and memory-to-memory copying, and improve performance
Structure of non-blocking calls
Post communication requests … // do some useful work Complete communication call non-blocking call, MPI_Isend … MPI_Wait, MPI_Test, … 3
Semantics etc
Non-blocking calls: MPI_Isend , MPI_Irecv etc Will return immediately. Merely post a request to system to initiate communication.
However, communication is not completed yet.
Cannot tamper with the memory provided in these calls until the communication is completed by calling MPI_Wait or MPI_Test etc Non-blocking send Non-blocking receive 4
Non-blocking Send/Recv
int MPI_Isend( void *buf, int count, MPI_Datatype datatype , int dest, int tag, MPI_Comm comm
MPI_Request req1, req2; double A[10], B[5]; … MPI_Isend(A, 10, MPI_DOUBLE, rank, tag, MPI_COMM_WORLD, &req1); MPI_Irecv(B, 5, MPI_DOUBLE, rank, tag, MPI_COMM_WORLD, &req2); 5
Other Non-blocking Sends
4 communication modes, same semantics as blocking sends.
MPI_ISEND – standard mode
MPI_IBSEND – buffered mode
MPI_ISSEND – synchronous mode
MPI_IRSEND – ready mode
Identical arguments as MPI_Isend int MPI_Ibsend( void *buf,int count,MPI_Datatype datatype , int dest, int tag, MPI_Comm comm , MPI_Request *request) int MPI_Issend( void *buf,int count,MPI_Datatype datatype , int dest, int tag, MPI_Comm comm , MPI_Request *request) int MPI_Irsend( void *buf,int count,MPI_Datatype datatype , int dest, int tag, MPI_Comm comm , MPI_Request *request) 6
Completion
Use MPI_Wait or MPI_Test to complete non-blocking communication
Semantics: after MPI_Wait returns
For standard send, message data has been safely stored away, safe to access buffer.
For receive, data is received.
7
MPI_Wait
int MPI_Wait(MPI_Request *request, MPI_Status *status) MPI_WAIT(REQUEST,STATUS,IERROR) INTEGER REQUEST, STATUS(MPI_STATUS_SIZE), IERROR *request is a handle returned from MPI_Isend, MPI_Irecv etc Will block until the communication completes (or fails) If request is from MPI_Isend , MPI_Irecv etc Will deallocate request object, set request MPI_REQUEST_NULL .
to Will return in status the status information. for MPI_Irecv , hold additional information.
For MPI_Isend , not much to be used MPI_Request req; MPI_Status stat; … MPI_Irecv(…, &req); MPI_Wait(&req, &stat); 8
MPI_Test
int MPI_Test(MPI_Request *request, int *flag, MPI_Status *status) MPI_TEST(REQUEST,FLAG,STATUS,IERROR) LOGICAL FLAG INTEGER REQUEST, STATUS, IERROR request – MPI_Request object from MPI_Isend , etc flag – true if communication complete; false if not yet If true, request object will be de-allocated, and set to MPI_REQUEST_NULL status – contain status information if complete Does not block, return immediately.
Provide a mechanism for overlapping communication and computation Do useful computation; periodically check communication status; if not complete, go back to computation.
9
Properties
Order: non-overtaking, order preserved according to the execution order of non-blocking calls that initiate the communications Progress: guarantees progress Receive call completed by MPI_Wait there is a matching send.
will eventually return if Send call completed by MPI_Wait will eventually return if there is a matching receive.
MPI_Comm_rank(comm,&rank); If(rank==0) { MPI_Isend(A,1,MPI_DOUBLE,1,99,comm,&req1); MPI_Isend(B,1,MPI_DOUBLE,1,99,comm,&req2); } Else if(rank==1) { MPI_Irecv(A,1,MPI_DOUBLE,0,MPI_ANY_TAG,comm,&req1); MPI_Irecv(B,1,MPI_DOUBLE,0,99,comm,&req2); } MPI_Wait(&req1,&stat1); MPI_Wait(&req2,&stat2); 10
MPI_Wait Variants
Deal with arrays of MPI_Requests: MPI_Request req[4]; MPI_Waitall: MPI_Waitall(int count, MPI_Request *request, MPI_Status *status) Blocks until all active requests in array complete; return status of all communications Deallocate request objects, set to MPI_REQUEST_NULL MPI_Waitany: MPI_Waitany(int count,MPI_Request *req, int *index, MPI_Status *stat) Blocks until one of the active requests in array completes; return its index in array and the status of completing request; deallocate that request object. If none completes, return index=MPI_UNDEFINED .
MPI_Waitsome: MPI_Waitsome(int incount, MPI_Request *req, int *outcount, int *array_indices, MPI_Status *array_status) Blocks until at least one of the active communications completes; return associated indices and status of completed communications; deallocate objects. If none, outcount=MPI_UNDEFINED .
MPI_Request req[2]; MPI_Status stat[2]; … MPI_Isend(…, &req[0]); MPI_Isend(…, &req[1]); MPI_Waitall(2, req, stat); MPI_Request req[2]; MPI_Status stat; Int index; MPI_Isend(…, &req[0]); MPI_Isend(…, &req[1]); MPI_Waitany(2, req, &index, &stat); … 11
MPI_Test Variants
MPI_Testall: MPI_Testall(int count, MPI_Request *array_req, int *flag, MPI_Status *array_stat) Return flag=true otherwise.
if all active requests complete; return flag=false If true, will de-allocate request objects, set to MPI_REQUEST_NULL .
MPI_Testany: MPI_Testany(int count, MPI_Request *array_req, int *index, int *flag, MPI_Status *stat) If one of active comm completes, return flag=true status of completing comm; deallocate that object.
the index and Return flag=false, index=MPI_UNDEFINED Return flag=true, index=MPI_UNDEFINED if none completes if none active requests.
MPI_Testsome: MPI_Testsome(int incount, MPI_Request *array_req, int *outcount, int *array_indices, MPI_Status *array_stat) Return in outcount the number of completed active comm and associated indices and status of completing comm.
If none completes, return outcount=0 if none active comm, outcount=MPI_UNDEFINED .
12
Persistent Communication
Structure for nonblocking calls: MPI_Ixxxx allocates MPI_Request MPI_Wait or MPI_Test completes and de-allocates request objects Often a communication with same arguments is executed repeatedly e.g. every time step or every iteration. Can create a persistent request that will not be de allocated by MPI_Wait. Reduce overhead Create persistent request MPI_Send_init , MPI_Recv_init Repeat: Start communication MPI_Start … Complete communication Free persistent request MPI_Wait , MPI_Test MPI_Request_free 13
Creation
int MPI_Send_init(void *buf, int count, MPI_Datatype datatype, int dest, int tag, MPI_Comm comm, MPI_Request *req) int MPI_Recv_init(void *buf, int count, MPI_Datatype datatype, int source, int tag, MPI_Comm comm, MPI_Request *req) Creates a persistent request object for standard send mode.
Bind to the arguments: buf, count, datatype, dest, tag, comm. These arguments will not change in following communications On creation, request inactive – not associated with any active communication. Communication initiated by MPI_Start MPI_Request req_send, req_recv; double A[100], B[100]; int left_neighbor, right_neighbor, tag=999; MPI_Status stat_send, stat_recv; … MPI_Send_init(A,100,MPI_DOUBLE,left_neighbor,tag,MPI_COMM_WORLD,&req_send); MPI_Recv_init(B,100,MPI_DOUBLE,right_neighbor,tag,MPI_COMM_WORLD,&req_recv); MPI_Start(&req_send); MPI_Start(&req_recv); … // do something else useful MPI_Wait(&req_send, &stat_send); MPI_Wait(&req_recv, &stat_recv); MPI_Request_free(&req_send); MPI_Request_free(&req_recv); 14
Start Communication, Free Request
int MPI_Start(MPI_Request *request) MPI_START(REQUEST) integer REQUEST request is a persistent request created by MPI_Send_init etc.
Start the communication on request object.
The call returns immediately. It starts a non-blocking communication . Should not access the buffer after this call until completion.
Complete communication by MPI_Wait , MPI_Test MPI_Wait , MPI_Test will not de-allocate the request upon completion of communication etc.
De-allocate persistent request using MPI_Request_free in the end.
int MPI_Request_free(MPI_Request *request) MPI_REQUEST_FREE(request) integer REQUEST 15
cpu 0
Example: Matrix-Vector Multiplication
A11 A A12 A13 X X1 Y Y1 AX=Y A – NxN matrix X,Y – vectors, dimension N cpu 1 A21 A22 A23 X2 = Y2 cpu 2 A31 A32 A33 X3 Y3 Y1 = A11*X1 + A12*X2 + A13*X3 Y2 = A21*X1 + A22*X2 + A23*X3 Y3 = A31*X1 + A32*X2 + A33*X3 cpu 0 cpu 1 cpu 2 A11 A12 A13 A21 A22 A23 A31 A32 A33 X2 X3 = Y1 Y2 X1 Y3 Y1 = A11*X1 + A12*X2 + A13*X3 Y2 = A21*X1 + A22*X2 + A23*X3 Y3 = A31*X1 + A32*X2 + A33*X3 cpu 0 cpu 1 cpu 2 A11 A12 A13 A21 A22 A23 A31 A32 A33 X3 X1 = Y1 Y2 X2 Y3 Y1 = A11*X1 + A12*X2 + A13*X3 Y2 = A21*X1 + A22*X2 + A23*X3 Y3 = A31*X1 + A32*X2 + A33*X3 16
Example: Matrix-Vector
Data on cpu 0: [A11 A12 A13] X1 Y1 N/3 x N matrix vector, length N/3 vector, length N/3 Data on cpu 1: [A21 A22 A23] X2 N/3 x N matrix vector, length N/3 Y2 vector, length N/3 Data on cpu 2: [A31 A32 A33] X3 N/3 x N matrix vector, length N/3 Y3 vector, length N/3 Need to communicate: X1, X2, X3 Upward shift. Number of shifts = ncpus-1 Assume: A[i][j] = i+j X[i] = i 17
#include
Example
(non-blocking comm)
int main(int argc, char **argv) { int ncpus, my_rank, left_neighbor, right_neighbor, tag=1001; int Nx, Ny; // Ny=DIM, Nx=DIM/ncpus, on each cpu: A[Nx][Ny], X[Nx], Y[Nx] MPI_Request req_sr[2]; MPI_Status stat_sr[2]; double **A, *X, *Y, *Xt; MPI_Init(&argc, &argv); MPI_Comm_rank(MPI_COMM_WORLD, &my_rank); MPI_Comm_size(MPI_COMM_WORLD, &ncpus); if(DIM%ncpus != 0) { // assume DIM dividable by ncpus if(my_rank==0) printf("ERROR: grid size cannot be divided by ncpus!\n"); MPI_Finalize(); return -1; } Nx = DIM/ncpus; // again on each cpu: A[Nx][Ny] etc Ny = DIM; left_neighbor = (my_rank-1 + ncpus)%ncpus; // top neighbor right_neighbor = (my_rank+1)%ncpus; // bottom neighbor A = DMath::newD(Nx, Ny); // allocate memory, ignore DMath – my own routine X = DMath::newD(Nx); Xt = DMath::newD(Nx); // Xt – temporary space for receiving from neighbor Y = DMath::newD(Nx); 18
int i,j; for(i=0;i int count; // loop counter int sindex, curr_block; memset(Y, '\0', sizeof(double)*Nx); // zero out result vector Y first for(count=0;count // clean up, free memory DMath::del(A); // Ignore DMath for now DMath::del(X); DMath::del(Xt); DMath::del(Y); } MPI_Finalize(); return 0; 20 ... MPI_Recv_init(Xt, Nx, MPI_DOUBLE,right_neighbor,tag,MPI_COMM_WORLD,&req_sr[0]); MPI_Send_init(X, Nx, MPI_DOUBLE, left_neighbor, tag, MPI_COMM_WORLD, &req_sr[1]); for(count=0;count 21 ... for(count=0;count 22 C A B11 B B12 B13 A1 A2 A3 B21 B22 B23 = B31 B32 B33 C1 C2 C3 C1 = A1*B11 + A2*B21 + A3*B31 cpu 0 C2 = A1*B12 + A2*B22 + A3*B32 cpu 1 C3 = A1*B13 + A2*B23 + A3*B33 cpu 2 Column-wise decomposition A, B, C – NxN matrices P – number of processors A1, A2, A3 – Nx(N/P) matrices C1, C2, C3 - … Bij – (N/P)x(N/P) matrices Input: A[i][j] = 2*i + j B[i][j] = 2*i – j 23 Implement the above parallel matrix multiplication (column-wise data decomposition) in either C, C++ or Fortran Use non-blocking communication or persistent communication in MPI Test your parallel implementation and make sure the result is correct Result for matrix C on p CPUs must be identical to that on 1 CPU Use a matrix size 2048x2048 (double) Time the “multiplication section” of your code using MPI_Wtime() routine for wall-clock time. Run your code on 1, 2, 4, 8, 16 CPUs and obtain the wall-clock time it takes: T1, T2, …, T16 Compute parallel speedup factors: Sp = T1/Tp, e.g. Sp=T1/T8 for 8 CPUs. Plot Sp vs. number of CPUs. Turn in: Source code + compiled binary code on either hamlet or radon. Table of wall-clock time vs. number of CPUs. Plot of parallel speedup factors. Write-up of what you have learned from the implementation and timing results Due date: Oct. 11 24 25 All processes in a group participate in communication, by calling the same function with matching arguments. Types of collective operations: Synchronization: MPI_Barrier Data movement: MPI_Bcast , MPI_Scatter , MPI_Gather , MPI_Allgather , MPI_Alltoall Collective computation: MPI_Reduce , MPI_Allreduce , MPI_Scan Collective routines are blocking: Completion of call means the communication buffer can be accessed No indication on other processes’ status of completion May or may not have effect of synchronization among processes. 26 MPI guarantees messages from collective communications will not be confused with PtP communications. If you want only a sub-group of processes involved in collective communication, need to create a sub group/sub-communicator from MPI_COMM_WORLD 27 int MPI_Barrier(MPI_Comm comm) MPI_BARRIER(COMM,IERROR) integer COMM, IERROR … MPI_Barrier(MPI_COMM_WORLD); // synchronization point … 28 int MPI_Bcast(void *buffer, int count, MPI_Datatype datatype,int root, MPI_Comm comm) MPI_BCAST(BUFFER, COUNT, DATATYPE, ROOT, COMM) integer BUFFER, COUNT, DATATYPE, ROOT, COMM Broadcasts a message from process with rank root to all processes in group, including itself. comm , root must be the same in all processes The amount of data sent must be equal to amount of data received, pairwise between each process and the root For now, means count and datatype must be the same for all processes; may be different when generalized datatypes are involved. 29Example
Example
Example: Persistent Communication
Example: Send-Recv
HWK#2: Matrix Multiplication
HWK #2
Collective Communications
Overview
Overview
Can use same communicators as PtP communications
Key is a group of processes partaking communication
Barrier
Blocks the calling process until all group members have called it.
Decreases performance. Refrain from using it explicitly.
Broadcast