Document

Transcript Document

‫به نام خدا‬
‫پردازش فوق سریع و پردازش‬
‫موازی‬
‫دکتر محمد کاظم اکبری‬
‫مرتضی سرگلزایی جوان‬
‫‪1‬‬
‫‪http://crc.aut.ac.ir‬‬
‫سرفصل مطالب‬
‫•‬
‫•‬
‫•‬
‫•‬
‫•‬
‫•‬
‫•‬
‫•‬
‫‪2‬‬
‫مقدمه ای بر کالستر‬
‫معماری کالستر‬
‫انواع مدلها و الگوهای برنامه نویسی موازی‬
‫انواع مدل های دسترسی به حافظه‬
‫چالش های موازی سازی‬
‫موازی سازی در سیستم های چند پردازنده ای‬
‫موازی سازی در سیستم های کالستر‬
‫برنامه نویسی موازی با ‪MPI‬‬
‫مقدمه ای بر کالستر‬
•
•
•
•
•
Cluster vs Mainframe
Design and Configuration
Data Sharing
Communication
Operating System / Cluster Management
• Task Scheduling
• Failure Management
• Programming Models
• Parallel Programming
• Debugging
• Monitoring
• Some implementations …
‫تفاوت کالستر و ‪Mainframe‬‬
‫‪4‬‬
‫معماری کالستر‬
Parallel Applications
Parallel Applications
Parallel Applications
Sequential Applications
Sequential Applications
Sequential Applications
Parallel Programming Environment
Cluster Middleware
(Single System Image and Availability Infrastructure)
PC/Workstation/Server
PC/Workstation/Server
PC/Workstation/Server
PC/Workstation/Server
Communications
Communications
Communications
Communications
Software
Software
Software
Software
Network Interface Hardware
Network Interface Hardware
Network Interface Hardware
Network Interface Hardware
Cluster Interconnection Network/Switch
Beowulf ‫کالستر‬
•
•
•
•
•
Home-built
Commodity-grade PCs
Unix-like OS
MPI / PVM
HPC using IPC
•
•
HPC: High Performance Computing
IPC: Inexpensive Personal Computers
6
‫کالستر ‪ Beowulf‬با ‪ 128‬پردازنده در‬
‫ناسا‬
‫‪7‬‬
‫‪1994‬‬
‫چند نمونه کالستر‬
•
•
•
•
•
•
•
•
•
•
•
System Name: FSL
Site: Forecast Systems Laboratory
Country: USA
Year: 2001
Node numbers: 556
Total # of processors: 556
Processor type: Alpha 21264A
Total Memory Capacity: 278 GB
Operating System: Linux
Total peak performance: 800 GFLOPS
Application area: Weather Forecasting
‫چند نمونه کالستر‬
•
•
•
•
•
•
•
•
•
•
•
System Name: Locus Supercluster
Site: Locus Discovery
Country: USA
Year: 2001
Node numbers: 708
Total # of processors: 1416
Processor type: Pentium III
Total Memory Capacity: 364 GB
Operating System: Linux
Total peak performance: 1416 GFLOPS
Application area: Pharmaceutical Drug Discovery
‫چند نمونه کالستر‬
•
•
•
•
•
•
•
•
•
•
•
System Name: RHIC
Site: Brookhaven National Laboratory
Country: USA
Year: 2003
Node numbers: 1097
Total # of processors: 2194
Processor type: Pentium III
Total Memory Capacity: 1020 GB
Operating System: Linux
Total peak performance: 3115 GFLOPS
Application area: Nuclear and High Energy Physics research
‫چند نمونه کالستر‬
•
•
•
•
•
•
•
•
•
•
•
•
System Name: ERAM
Site: HPCRC at Tehran Polytechnic
Country: IRAN
Year: 2011
Node numbers: 288
Total # of processors: 4600
Processor: AMD Opteron 2.3 GHz
Total Memory Capacity: 9 TB
Total Storage Capacity: 160 TB
Total peak performance: 42 teraflops
Total Processing Capacity (+GPU): 89 teraflops
Estimated Rand in 2011: 107
11
‫چند نمونه کالستر‬
•
•
•
•
•
•
•
System Name: Titan
Site: DOE / Oak Ridge National Laboratory
Cores: 560,640
Rmax: 17 Pflops/s
Rpeak: 27 Pflops/s
Power: 8,209 kW
Memory: 710 TB
12
‫سریعترین ابررایانه جهان (نیمه‬
)2015 ‫اول‬
•
•
•
•
•
•
•
•
•
•
Name: Tianhe-2 (MilkyWay-2)
Rank #1 Top Supercomputer @ http://top500.org
Manufacture: National University for Defense Technology
Changsha, China
Cores: 3,120,000
Rmax: 34 PFlop/s
Rpeak: 55 PFlop/s
Power: 17,808 kW
Memory: 1 PB
Processor: Intel Xeon E5-2692v2 12C 2.2GHz
Application: simulation, analysis, and government security applications
13
‫روند توسعه کارآیی کالسترها ‪/‬‬
‫ابررایانه ها‬
‫‪14‬‬
‫پردازش فوق سریع در ابر‬
Amazon EC2 Cluster
Top500.org: Rank #127 at 2013 and #240 at 2014
15
http://top5000.org/system/177457
‫پردازش فوق سریع در ابر‬
Cycle Computing ‫توسط‬
Cores: 156,314
Speed: 1.21 Petaflops
Provider: Amazon
Cost: $33K (18 hours)
Year: Nov 2013
Rank #107 Top500 .org
CycleCloud SW Features:
• Automate bidding
• Acquiring
• Testing
• Assembling
• Data and Workload Distributing
16
‫توسعه چهارچوب های پردازش فوق‬
‫سریع در ابر‬
‫‪17‬‬
Approaches for Parallel Programs Development
• Implicit Parallelism
•
Supported by parallel languages and parallelizing compilers that take care of
identifying parallelism, the scheduling of calculations and the placement of data.
• Explicit Parallelism
•
•
In this approach, the programmer is responsible for most of the parallelization effort
such as task decomposition, mapping task to processors, the communication structure.
This approach is based on the assumption that the user is often the best judge of how
parallelism can be exploited for a particular application.
Parallel Programming Models
•
•
•
•
Shared Memory Model
•
•
•
DSM (Distributed Shared Memory)
Threads/OpenMP (enabled for clusters)
Java threads (HKU JESSICA, IBM cJVM)
Message Passing Model
•
•
PVM (Parallel Virtual Machine)
MPI (Message Passing Interface)
Hybrid Model
•
•
Mixing shared and distributed memory model
Using OpenMP and MPI together
Object and Service Oriented Models
•
Wide area distributed computing technologies
•
•
OO: CORBA, DCOM, etc.
Services: Web Services-based service composition
‫سطوح موازی سازی‬
PVM/MPI
Threads
Compilers
CPU
Task i-l
Task i
func1 ( )
{
....
....
}
func2 ( )
{
....
....
}
a ( 0 ) =..
b ( 0 ) =..
+
a ( 1 )=..
b ( 1 )=..
x
Task i+1
func3 ( )
{
....
....
}
a ( 2 )=..
b ( 2 )=..
Load
Code-Granularity
Code Item
Large grain
(task level)
Program
Medium grain
(control level)
Function (thread)
Fine grain
(data level)
Loop (Compiler)
Very fine grain
(instruction level)
With hardware
20
‫مثال دو سناریوی مختلف‬
‫‪21‬‬
‫مراحل موازی سازی‬
• Partitioning
•
Decomposition of computational activities and the data into small tasks – there exist number of
paradigms – e.g. master worker, pipeline, divide and conquer and SPMD.
• Communication
•
Flow of information and coordination among tasks that are created in the portioning stage.
• Agglomeration
•
Tasks and communication structure created in the above stages are evaluated for performance
and implementation cost. Tasks may be grouped into larger tasks to improve communication.
Individual communications can be bundled.
• Mapping / Scheduling
•
Assigning tasks to processors such that job completion time is minimized and resource
utilization is maximized. Even the cost of computation can be minimized based on QoS
requirements.
‫الگوهای موازی سازی‬
• Parallel computing has been around for decades
• Here are some “design patterns” …
•
•
•
•
•
•
Master-Slave
Single-Program Multiple-Data (SPMD)
Pipelining
Divide and Conquer
Producer/Consumer
Work Queues
Worker ‫انواع‬
• Different threads in the same core
• Different cores in the same CPU
• Different CPUs in a multi-processor system
• Different machines in a distributed system
Master/Slaves
master
slaves
Master Worker/Slave Model
•
•
Master decomposes the problem into small tasks,
distributes to workers and gathers partial results to
produce the final result.
Mapping/Load Balancing
•
•
Static
Dynamic
•
When number of tasks are larger than the number of CPUs /
they are know at runtime / CPUs are heterogeneous.
Divide and Conquer
“Work”
Partition
w1
w2
w3
“worker”
“worker”
“worker”
r1
r2
r3
“Result”
Combine
Single-Program Multiple-Data
• Most commonly used model.
• Each process executes the same piece
of code, but on different parts of the
data.—splitting the data among the
available processors.
• Different names: geometric/domain
decomposition, data parallelism.
Producer/Consumer Flow
P
C
P
C
P
C
P
C
P
C
P
C
Data Pipelining
• Suitable for fine grained parallelism.
• Also suitable for application involving multiple stages of execution, but need
to operate on large number of data sets.
Work Queues
P
C
shared queue
P
P
W
W
W
W
W
C
C
)1( ‫چالش های موازی سازی‬
• How do we assign work units to workers?
• What if we have more work units than workers?
• What if workers need to share partial results?
• How do we aggregate partial results?
• How do we know all the workers have finished?
• What if workers die?
)2( ‫چالش های موازی سازی‬
• Parallelization problems arise from:
•
•
Communication between workers
Access to shared resources (e.g., data)
• Thus, we need a synchronization system!
•
•
Finding bugs is hard
Solving bugs is even harder
‫انواع مدل های دسترسی به حافظه‬
• Shared-memory model
The communication between processes in a multiprocessor environment is achieved
through shared (or global) memory.
• Message-passing model
Communication between processes in a multicomputer environment is achieved
through some kind of message-switching mechanism.
34
Memory Typology: Shared
Processor
Processor
Memory
Processor
Processor
Memory Typology: Distributed
Processor
Memory
Processor
Memory
Processor
Memory
Network
Processor
Memory
Memory Typology: Hybrid
Processor
Processor
Memory
Memory
Processor
Processor
Network
Processor
Processor
Memory
Processor
Memory
Processor
Flynn’s Taxonomy
Single (SD)
Multiple (MD)
Data
Instructions
Single (SI)
Multiple (MI)
SISD
MISD
Single-threaded process
Pipeline architecture
SIMD
MIMD
Vector Processing
Multi-threaded
Programming
and GPUs
and
Multi-Core Processors
SISD
Processor
D
D
D
D
Instructions
D
D
D
MISD
Processor
D
D
D
D
Instructions
Processor
D
D
D
D
Instructions
40
MIMD
Processor
D
D
D
D
D
D
D
D
D
D
Instructions
Processor
D
D
D
D
Instructions
SIMD
Processor
D0
D0
D0
D0
D0
D0
D0
D1
D1
D1
D1
D1
D1
D1
D2
D2
D2
D2
D2
D2
D2
D3
D3
D3
D3
D3
D3
D3
D4
D4
D4
D4
D4
D4
D4
…
…
…
…
…
…
…
Dn
Dn
Dn
Dn
Dn
Dn
Dn
Instructions
CPU vs GPU
Parallel Programming on Multiprocessors
• A parallel language should support certain operations in order to be used to
write a parallel program in a multiprocessor environment:
•
•
Process Creation
Synchronization
•
•
•
•
•
Lock and Unlock
Wait and Signal (or Increment and Decrement)
Fetch & Add
Barrier
Deadlock
44
Synchronization
• In a multiprocessor environment, using shared data for communication
between processes, simultaneous access to a shared data, by two or more
processes, may cause invalid results.
• Access to such shared data must be mutually exclusive.
45
Lock and Unlock (cont)
• Critical section
A group of statements that must be executed or accessed by at most a certain number
of processes at any given time.
• Monitor
A structure that provides exclusive access to a critical section is called a monitor.
46
Monitor Structure
•
A monitor represents a kind of fence around the shared data:
•
The general structure of a monitor can be represented as:
Lock(L)
<Critical Section>
Unlock(L)
47
An Example for Fetch&ADD
• Consider the problem of adding two vectors in parallel:
for (i=1; i<=k; i++) {
Z[i] = A[i] + B[i] ;
}
• Now assume there are several processes each computes the addition for a
specific i.
• At any time, each process requests a subscript ,say i in the range 1 to k, to
evaluate Z[i]=A[i]+B[i].
48
An Example for Fetch&ADD (cont)
• Now the code for each process is as follows:
int i;
i = Fetch&Add(next_index,1);
while (i<=K) {
Z[i] = A[i] + B[i];
i = Fetch&Add(next_index,1);
}
49
Barrier
• A barrier is a point of synchronization where a predetermined number of
processes has to arrive before the execution of the program can continue.
• It is used to ensured that a certain number of processes complete a stage of
computation before they proceed to a next stage that requires the results of
the previous stage.
50
An Example for Barrier
• Consider the following computation on two vectors A and B:
sum=0;
for (i=1; i<=10; i++){
sum = sum + A[i];
}
for (i=1; i<=10; i++){
B[i] = B[i] / sum;
}
• Assume there are two processes performing the computation in two stages.
51
An Example for Barrier (cont)
• Stage1:
•
•
process0 adds values of A[1], A[3], A[5], A[7], and A[9]
process1 adds values of A[2], A[4], A[6], A[8], and A[10]
• Stage2:
•
•
process0 computes new values for B[1], B[3], B[5], B[7], and B[9]
process1 computes new values for B[2], B[4], B[6], B[8], and B[10].
52
Deadlock
• Deadlock is the situation when two or more processes request and hold
mutually needed resources in a circular pattern
53
Parallel Programming on Multicomputers
• A multicomputer consists of several processors called nodes connected with
an interconnection network.
• Each processor has its own local memory.
• Nodes coordinate themselves through sending messages to each other.
54
Message-passing in Multicomputers
• A message may be either a control or a data message.
• Each message carries some additional information
•
•
•
Destination process id
Message length
Source process id
55
Creation of Processes
• Processes may be created such that:
•
Data partitioning
All processes perform the same function on different portions of the
data.(homogeneous multitasking)
•
Control(function) partitioning
Each process performs a different function on the data.(heterogeneous multitasking)
56
An Example of Partitioning
• Consider the following computation:
Z[i] = (A[i] * B[i]) + (C[i] / D[i]), for i=1 to 10.
• For data partitioning:
•
•
10 identical processes are created.
Each process performs the computation for a unique index i.
• For function partitioning:
•
•
•
Two different processes P1 and P2 are created.
P1 performs x=A[i] * B[i] and sends x to P2.
P2 in turn computes y=C[i]/D[i] and after it receives x from P1 it performs Z[i]= x + y.
57
Data Partitioning Advantages
• Data partitioning offers the following advantages over function partitioning:
•
•
•
Higher parallelism
Equally balanced load among processes
Easier to implement
58
Assignment or Scheduling
The mapping of processes to processors is referred to as assignment or
scheduling.
•
•
Static : Processes and the order of their execution are known prior to execution.
Dynamic : Processes are assigned to processors at run time.
59
Data Parallel Algorithms
• Often data parallel algorithm is made from certain standard
features called building blocks.
• Some of the well known building blocks are
•
•
•
•
Elementwise operations
Broadcasting
Reduction
Parallel prefix
60
Elementwise Operations
• They are the type of operations that can be performed by the
•
•
•
processors independently.
Examples of such operations are arithmetic, logical, and conditional
operations.
For example consider addition operation on two vectors A and B, that
is, C=A+B.
The ith processor (here for i=0 to 7) adds the ith element of A to the ith
element of B and
stores the result in
the ith element of C.
61
Elementwise Operations (cont)
• Some conditional operations can also be carried out elementwise.
• For example, consider the following if statement on vectors A, B, and C:
If (A>B), then C=A+B.
• First, the contents of vectors A and B are compared element by element.
• The result of the
comparison sets a
flag at each processor.
62
Elementwise Operations (cont)
• These flags, often called a condition mask, can be used for
further operations.
• If the test is successful, the flag is set to 1; otherwise it is set
to 0.
• To compute C=A+B, each processor performs addition when
its flag is set to 1.
63
Broadcasting
• A broadcast operation makes multiple copies of a single value (or several
data) and distributes them to all (or some) processors.
• For example a shared bus can be used to copy a value 5 to eight processors.
64
Broadcasting (cont)
• Sometimes it is needed to copy several data to several processors.
• The values of a vector, which are stored in the processors of row 0, are
duplicated in the other processors.
65
Broadcasting (cont)
• At each step, the values of the ith row of processors (for i=0 to 6) are
copied to the (i+1)th row of processors.
66
Broadcasting (cont)
•
For a mechanism that copies the contents of a row of processors to another row being 2i
away (i≥0), a faster method can be used.
67
Reduction
• Reduction operation converts several values to a single value.
• The operation can be addition, product, maximum, minimum, and logical
AND, OR, and exclusive-OR.
• For example, consider addition operation when each element to be added is
stored in a processor.
• One way (a hardware approach) is to perform such summation is to have a
hardwired addition circuit.
• Another way (an algorithmic approach) is to perform summation through
several steps.
68
Reduction (cont)
•
For example, for eight processors:
•
•
•
In the first step, the processor i (for odd
i) adds its value to the value of the
processor i-1.
In the second step, the value of
processor i (for i=3 and 7) is added to
processor i-2.
Finally, in the third step, the value of
processor 3 is added to 7.
69
Parallel Prefix
• Sometimes, for a reduction operation, it is required that the final value of
each processor be the result of performing the reduction on that processor
and its preceding processors.
• Such a computation is called parallel prefix (also referred to as forward
scan).
• For addition, the computation is called a sum-prefix operation since it
computes sums over all prefixes of the vector.
70
Parallel Prefix (cont)
• Now let’s see how the sum-prefix is performed on our previous example:
•
•
•
In the first step, processor i (for i>0) adds its value to the value of processor i-1.
In the second step, the value of processor i (for i>1) is added to processor i-21.
Finally, in the third step, the value of processor i (for i>3) is added to processor i-22.
• At the end of operation, each processor contains the sum of its value and all
the preceding processors.
71
Parallel Prefix (cont)
• In the previous solution not all the processors were kept busy during the
operation.
• However, in this solution, all the processors are utilized.
72
A Comprehensive Example of Building blocks
• Consider the multiplication of two n-by-n A and B matrices.
• A simple parallel algorithm could be to use
•
•
•
broadcast,
elementwise multiplication
reduction sum operations
to perform such a task.
73
A Comprehensive Example of Building blocks
(cont)
• Assume that there are n3 processors arranged
in a cube form:
•
•
Matrix A is loaded on the front n2 processors.
Matrix B is loaded on the top n2 processors.
74
A Comprehensive Example of Building blocks
(cont)
• In the first step of the algorithm, the values
of matrix A are broadcast onto the
processors.
• In the second step, the values of matrix B
are broadcasted onto the processors.
• Each of these steps takes O(log2n) time.
75
A Comprehensive Example of Building blocks
(cont)
• In the third step,
an elementwise multiply
operation is performed
by each processor.
• This operation takes O(1) time.
• Finally, in the fourth step, sum-prefix operation
is performed.
• This operation takes O(log2n) time.
• Therefore, the total time is O(log2n).
76
A More Detailed Example
• Now consider A and B to be 2-by-2 matrices:
A B  C
 a11 a12  b11 b12  c11 c12 



a


a
b
b
c
c
22   21
22 
 21
 21 22 
• The value (or values) of each processor for multiplying
these two matrices are shown
in the following slides.
77
(a) Loading A and B
78
(b) Broadcasting A
79
(c) Broadcasting B
80
(d) Elementwise Multiplication
81
(e) Parallel Sum-Prefix
82
‫برنامه نویسی موازی‬
‫با ‪MPI‬‬
‫‪83‬‬
Message-Passing Programming Paradigm
• Each processor in a message-passing program runs a subprogram
•
•
•
written in a conventional sequential language
all variables are private
communicate via special subroutine calls
M
M
M
P
P
P
Memory
Processors
Interconnection Network
84
Single Program Multiple Data
• Introduced in data parallel programming (HPF)
• Same program runs everywhere
• MPI 1.0 contains over 115 routines/functions that can be grouped into 8
categories.
85
Examples
main(int argc, char **argv)
{
if(process is to become Master)
{
MasterRoutine(/*arguments*/)
}
else /* it is worker process */
{
WorkerRoutine(/*arguments*/)
}
}
86
Messages
•
•
Messages are packets of data moving
between sub-programs.
The message passing system has to be told
the following information
•
•
•
•
•
•
•
Sending processor
Source location
Data type
Data length
Receiving processor(s)
Destination location
Destination size
87
Messages
•
•
•
•
Addressing:
•
Messages need to have addresses to be sent to
Reception:
•
It is important that the receiving process is capable of dealing with the messages it is sent
A message passing system is similar to:
•
Post-office, Phone line, Fax, E-mail, etc
Message Types:
•
Point-to-Point, Collective, Synchronous (telephone)/Asynchronous (Postal)
88
Point-to-Point Communication
• A simplest form of message passing
• One process sends a message to another
89
Point-to-Point variations
• Synchronous Sends
•
provide information about the completion of the message
• Asynchronous Sends
•
Only know when the message has left
• Blocking operations
•
only return from the call when operation has completed
• Non-blocking operations
•
return straight away - can test/wait later for completion
90
Group / Collective Communications
•
Can be built out of point-to-point communications
•
Barriers
•
•
Broadcast
•
•
synchronise processes
one-to-many communication
Reduction operations
•
combine data from several processes to produce a single (usually) result
91
General MPI Program Structure
MPI Include File
Initialize MPI Environment
Do work and perform message communication
Terminate MPI Environment
92
Example - C
#include <mpi.h>
main(int argc, char **argv)
{
/* initialize MPI */
MPI_Init(&argc, &argv);
/* main part of program */
/* terminate MPI */
MPI_Finalize();
exit(0);
}
93
MPI helloworld.c
#include <mpi.h>
main(int argc, char **argv)
{
int numtasks, rank;
MPI_Init(&argc, &argv);
MPI_Comm_size(MPI_COMM_WORLD, &numtasks);
MPI_Comm_rank(MPI_COMM_WORLD, &rank);
printf("Hello World from process %d of %d\n“, rank, numtasks);
MPI_Finalize();
}
94
MPI_COMM_WORLD
•
•
•
•
•
•
MPI_INIT defines a communicator called
MPI_COMM_WORLD for every process that
calls it.
All MPI communication calls require a
communicator argument
MPI processes can only communicate if they
share a communicator.
A communicator contains a group which is a
list of processes
Each process has it’s rank within the
communicator
A process can have several communicators
95
Communicators
• MPI uses objects called Communicators that defines which collection of
processes communicate with each other.
• Every process has unique integer identifier assigned by the system when
the process initialises. A rand is sometimes called process ID.
• Processes can request information from a communicator
•
MPI_Comm_rank(MPI_comm comm, int *rank)
•
•
Returns the rank of the process in comm
MPI_Comm_size(MPI_Comm comm, int *size)
•
Returns the size of the group in comm
96
Communicators and Groups
• An intracommunicator is used
for communication within a
signle group.
• An intercommunicator is used
for communication between 2
disjoint groups.
97
Finishing up
• An MPI program should call MPI_Finalize when all communications
have completed.
• Once called no other MPI calls can be made
• Aborting:
MPI_Abort(comm)
• Attempts to abort all processes listed in comm
if comm = MPI_COMM_WORLD the whole program terminates
98
Compile and Run Commands
•
•
•
•
Compile:
•
> mpicc helloworld.c -o helloworld
Run:
•
•
> mpirun -np 3 helloworld [hosts picked from configuration]
> mpirun -np 3 -machinefile machines.list helloworld
The file machines.list contains nodes list:
•
•
•
•
•
•
•
hpcrc.aut.ac.ir
Crc.aut.ac.ir
node1
node2
..
No of processes
node6
node13
Some nodes may not work today if they had failed!
99
Sample Run and Output
• A Run with 3 Processes:
•
> mpirun -np 3 -machinefile machines.list helloworld
 Hello World from process 0 of 3
 Hello World from process 1 of 3
 Hello World from process 2 of 3
 A Run by default
•
> helloworld
 Hello World from process 0 of 1
100
Sample Run and Output
• A Run with 6 Processes:
•
> mpirun -np 6 -machinefile machines.list helloworld
•
•
•
•
•
•
Hello World from process 0 of 6
Hello World from process 3 of 6
Hello World from process 1 of 6
Hello World from process 5 of 6
Hello World from process 4 of 6
Hello World from process 2 of 6
• Note: Process execution need not be in process number order.
101
Hello World with Error Check
102
Display Hostname of MPI Process
#include <mpi.h>
main(int argc, char **argv)
{
int numtasks, rank;
int resultlen;
static char mpi_hostname[MPI_MAX_PROCESSOR_NAME];
MPI_Init(&argc, &argv);
MPI_Comm_size(MPI_COMM_WORLD, &numtasks);
MPI_Comm_rank(MPI_COMM_WORLD, &rank);
MPI_Get_processor_name( mpi_hostname, &resultlen );
printf("Hello World from process %d of %d running on %s\n", rank, numtasks, mpi_hostname);
MPI_Finalize();
}
103
MPI Routines
• Environment Management
• Point-to-Point Communication
• Collective Communication
• Process Group Management
• Communicators
• Derived Type
• Virtual Topologies
• Miscellaneous Routines
104
Environment Management Routines
105
Point-to-Point Communication
106
Collective Communication Routines
107
Process Group Management Routines
108
Communicators Routines
109
Virtual Topologies Routines
110
Miscellaneous Routines
111
Standard Send (cont.)
MPI_Send(void *buf, int count, MPI_Datatype datatype, int dest,
int tag, MPI_Comm comm)
buf
the address of the data to be sent
count
the number of elements of datatype buf contains
datatype
the MPI datatype
dest
rank of destination in communicator comm
tag
a marker used to distinguish different message types
comm
the communicator shared by sender and receiver
ierror
the fortran return value of the send
112
Group Communication Routines
•
int MPI_Bcast(void* message,int count, MPI_Datatype
datatype,int root, MPI_Comm comm)
•
int MPI_Reduce(void* operand, void* result,int count,
MPI_Datatype datatype, MPI_OP op,int root, MPI_Comm
comm)
•
The most common OP
•
•
•
MPI_SUM : summation
MPI_MAX : maximum
MPI_PROD : product
MPI_MIN : minimum
int MPI_Barrier(MPI_Comm comm)
113
Different Communication Methods
114
MPI Send/Receive a Character (cont...)
#include <mpi.h>
#include <stdio.h>
int main(int argc, char *argv[])
{
int numtasks, rank, dest, source, rc, tag=1;
char inmsg, outmsg='X';
MPI_Status Stat;
MPI_Init(&argc,&argv);
MPI_Comm_size(MPI_COMM_WORLD, &numtasks);
MPI_Comm_rank(MPI_COMM_WORLD, &rank);
115
MPI Send/Receive a Character
if (rank == 0) {
dest = 1;
rc = MPI_Send(&outmsg, 1, MPI_CHAR, dest, tag, MPI_COMM_WORLD);
printf("Rank0 sent: %c\n", outmsg);
source = 1;
rc = MPI_Recv(&inmsg, 1, MPI_CHAR, source, tag, MPI_COMM_WORLD, &Stat);
}
116
MPI Send/Receive a Character
else if (rank == 1) {
source = 0;
rc = MPI_Recv(&inmsg, 1, MPI_CHAR, source, tag, MPI_COMM_WORLD, &Stat);
printf("Rank1 received: %c\n", inmsg);
dest = 0;
rc = MPI_Send(&outmsg, 1, MPI_CHAR, dest, tag, MPI_COMM_WORLD);
}
MPI_Finalize();
}
117
MPI.NET
118
‫تمرین‬
‫اجرای مثال ‪ Word Count‬با استفاده از ‪MPI‬‬
‫و ارایه گزارشی از الگوریتم ‪ /‬فلوچارت مورد استفاده بهمراه فرضیات‬
‫‪119‬‬
‫ابر و باران‬
120
http://crc.aut.ac.ir

Document

Transcript Document

Directory