No Slide Title

Download Report

Transcript No Slide Title

Scalable Parallel Architectures
and their
Software
NPACI Parallel Computing Seminars
San Diego Supercomputing Center
NATIONAL PARTNERSHIP FOR ADVANCED COMPUTATIONAL INFRASTRUCTURE
San DIEGO SUPERCOMPUTER CENTER
Introduction
• Overview of RISC CPUs, Memory Hierarchy
• Parallel Systems - General Hardware Layout (SMP, Distributed,
Hybrid)
• Communications Networks for Parallel Systems
• Parallel I/O
• Operating Systems Concepts
• Overview of Parallel Programming Methodologies
– Distributed Memory
– Shared-Memory
• Hardware Specifics of NPACI Parallel Machines
– IBM SP Blue Horizon
– New CPU Architectures
• IBM Power 4
• Intel IA-64
NATIONAL PARTNERSHIP FOR ADVANCED COMPUTATIONAL INFRASTRUCTURE
San DIEGO SUPERCOMPUTER CENTER
2
What is Parallel Computing?
• Parallel computing: the use of multiple computers or
processors or processes working together on a common
task.
– Each processor works on its section of the problem
– Processors are allowed to exchange information (data in local
memory) with other processors
Grid of Problem to be solved
y
CPU #1 works on this area exchange CPU #2 works on this area
of the problem
of the problem
CPU #3 works on this area
of the problem
exchange
CPU #4 works on this area
of the problem
x
NATIONAL PARTNERSHIP FOR ADVANCED COMPUTATIONAL INFRASTRUCTURE
San DIEGO SUPERCOMPUTER CENTER
3
Why Parallel Computing?
• Limits of single-CPU computing
– Available memory
– Performance - usually “time to solution”
• Limits of Vector Computers – main HPC alternative
– System cost, including maintenance
– Cost/MFlop
• Parallel computing allows:
– Solving problems that don’t fit on a single CPU
– Solving problems that can’t be solved in a reasonable time on one
CPU
• We can run…
– Larger problems
– Finer resolution
– Faster
– More cases
NATIONAL PARTNERSHIP FOR ADVANCED COMPUTATIONAL INFRASTRUCTURE
San DIEGO SUPERCOMPUTER CENTER
4
Scalable Parallel Computer Systems
(Scalable) [ ( CPUs) + (Memory) + (I/O) +
(Interconnect) + (OS) ] = Scalable Parallel
Computer System
NATIONAL PARTNERSHIP FOR ADVANCED COMPUTATIONAL INFRASTRUCTURE
San DIEGO SUPERCOMPUTER CENTER
5
Scalable Parallel Computer Systems
Scalablity: A parallel system is scalable if it is capable of providing
enhanced resources to accommodate increasing performance
and/or functionality
•
•
Resource scalability: scalability achieved by increasing machine size ( #
CPUs, memory, I/O, network, etc.)
Application scalability
– machine size
– problem size
NATIONAL PARTNERSHIP FOR ADVANCED COMPUTATIONAL INFRASTRUCTURE
San DIEGO SUPERCOMPUTER CENTER
6
Shared and Distributed Memory Systems
CPU
CPU
CPU
CPU
M
M
M
M
CPU
CPU
CPU
CPU
BUS/CROSSBAR
MEMORY
NETWORK
Multicomputer (Distributed memory)
-Each processor has it’s own local
-memory.
Examples: CRAY T3E, IBM SP2,
PC Cluster
Multiprocessor (Shared memory)
-Single address space. All processors
have access to a pool of shared memory.
Examples: SUN HPC, CRAY T90, NEC SX-6
Methods of memory access :
- Bus
- Crossbar
NATIONAL PARTNERSHIP FOR ADVANCED COMPUTATIONAL INFRASTRUCTURE
San DIEGO SUPERCOMPUTER CENTER
7
Hybrid (SMP Clusters) Systems
CPU
CPU
CPU
CPU
Interconnect
MEMORY
CPU
CPU
CPU
CPU
CPU
Interconnect
CPU
CPU
CPU
Interconnect
MEMORY
MEMORY
Network
Hybrid Architecture – Processes share memory on-node,
may/must use message-passing off-node, may share off-node memory
Example: IBM SP Blue Horizon, SGI Origin, Compaq Alphaserver
NATIONAL PARTNERSHIP FOR ADVANCED COMPUTATIONAL INFRASTRUCTURE
San DIEGO SUPERCOMPUTER CENTER
8
RISC-Based Computer Hardware Concepts
RISC CPUs most common CPUs in HPC – many design
concepts transferred from vector CPUs to RISC to
CISC
• Multiple Functional Units
• Pipelined Instructions
• Memory Hierarchy
• Instructions typically take 1-several CPU clock cycles
– Clock cycles provide time scale for measurement
• Data transfers – memory-to-CPU, network, I/O, etc.
NATIONAL PARTNERSHIP FOR ADVANCED COMPUTATIONAL INFRASTRUCTURE
San DIEGO SUPERCOMPUTER CENTER
9
Laura C. Nett:
Processor Related Terms
Instruction set is just how each operation is processed x=y+1
load y and a
• RISC : Reduced Instruction Set Computers
put in x
• PIPELINE : Technique where multiple instructions are
overlapped in execution
• SUPERSCALAR : Computer design feature - multiple
instructions can be executed per clock period
add y and a
NATIONAL PARTNERSHIP FOR ADVANCED COMPUTATIONAL INFRASTRUCTURE
San DIEGO SUPERCOMPUTER CENTER
10
‘Typical’ RISC CPU
CPU Chip
Functional Units
Memory/Cache
registers
FP Add
r0
r1
FP Multiply
r2
Loads &
Stores
.
FP Multiply & Add
.
.
FP Divide
.
r32
NATIONAL PARTNERSHIP FOR ADVANCED COMPUTATIONAL INFRASTRUCTURE
San DIEGO SUPERCOMPUTER CENTER
11
Functional Unit
Chair Building Function Unit
Carpenter 1 Carpenter 2
Carpenter 3 Carpenter 4
Carpenter 5
• Fully Segmented - A(I)=C(I)*D(I)
C(I)
A(I)
D(I)
Multiply pipeline length
NATIONAL PARTNERSHIP FOR ADVANCED COMPUTATIONAL INFRASTRUCTURE
San DIEGO SUPERCOMPUTER CENTER
12
Dual Hardware Pipes
A(I) = C(I)*D(I)
odd C(I)
odd C(I)
A(I) & A(I+1)
even C(I+1)
even D(I+1)
NATIONAL PARTNERSHIP FOR ADVANCED COMPUTATIONAL INFRASTRUCTURE
San DIEGO SUPERCOMPUTER CENTER
13
RISC Memory/Cache Related Terms

ICACHE : Instruction cache

DCACHE (Level 1) : Data cache closest to registers

SCACHE (Level 2) : Secondary data cache

Data from SCACHE has to go through DCACHE to registers

SCACHE is larger than DCACHE

All processors do not have SCACHE

CACHE LINE: Minimum transfer unit (usually in bytes) for moving data between different levels of
memory hierarchy

TLB : Translation-look-aside buffer keeps addresses of pages ( block of memory) in main memory
that have been recently accessed
•
•
MEMORY BANDWIDTH: Transfer rate (in MBytes/sec) between different levels of memory
MEMORY ACCESS TIME: Time required (often measured in clock cycles) to bring data items
from one level in memory to another
•
CACHE COHERENCY: Mechanism for ensuring data consistency of shared variables
across memory hierarchy
NATIONAL PARTNERSHIP FOR ADVANCED COMPUTATIONAL INFRASTRUCTURE
San DIEGO SUPERCOMPUTER CENTER
14
RISC CPU, CACHE, and MEMORY
Basic Layout
SPEED
SIZE Cost ($/bit)
CPU
Registers
Level 1 Cache
Level 2 Cache
MAIN MEMORY
NATIONAL PARTNERSHIP FOR ADVANCED COMPUTATIONAL INFRASTRUCTURE
San DIEGO SUPERCOMPUTER CENTER
15
RISC Memory/Cache Related Terms

ICACHE : Instruction cache

DCACHE (Level 1) : Data cache closest to registers

SCACHE (Level 2) : Secondary data cache

Data from SCACHE has to go through DCACHE to registers

SCACHE is larger than DCACHE

All processors do not have SCACHE

CACHE LINE: Minimum transfer unit (usually in bytes) for moving data between different levels of
memory hierarchy

TLB : Translation-look-aside buffer keeps addresses of pages ( block of memory) in main memory
that have been recently accessed
•
•
MEMORY BANDWIDTH: Transfer rate (in MBytes/sec) between different levels of memory
MEMORY ACCESS TIME: Time required (often measured in clock cycles) to bring data items
from one level in memory to another
•
CACHE COHERENCY: Mechanism for ensuring data consistency of shared variables
across memory hierarchy
NATIONAL PARTNERSHIP FOR ADVANCED COMPUTATIONAL INFRASTRUCTURE
San DIEGO SUPERCOMPUTER CENTER
16
RISC Memory/Cache Related Terms (cont.)
Direct mapped cache: A block from main memory can go in
exactly one place in the cache. This is called direct mapped
because there is direct mapping from any block address in
memory to a single location in the cache.
cache
Main memory
NATIONAL PARTNERSHIP FOR ADVANCED COMPUTATIONAL INFRASTRUCTURE
San DIEGO SUPERCOMPUTER CENTER
17
RISC Memory/Cache Related Terms (cont.)
Fully associative cache : A block from main memory can be
placed in any location in the cache. This is called fully
associative because a block in main memory may be
associated with any entry in the cache.
cache
Main memory
NATIONAL PARTNERSHIP FOR ADVANCED COMPUTATIONAL INFRASTRUCTURE
San DIEGO SUPERCOMPUTER CENTER
18
RISC Memory/Cache Related Terms (cont.)
Set associative cache : The middle range of designs between
direct mapped cache and fully associative cache is called setassociative cache. In a n-way set-associative cache a block
from main memory can go into n (n at least 2) locations in the
cache.
2-way set-associative cache
Main memory
NATIONAL PARTNERSHIP FOR ADVANCED COMPUTATIONAL INFRASTRUCTURE
San DIEGO SUPERCOMPUTER CENTER
19
RISC Memory/Cache Related Terms
• The data cache was designed to allow programmers to
take advantage of common data access patterns :
 Spatial Locality
 When an array element is referenced, its neighbors are likely to be
referenced
 Cache lines are fetched together
 Work on consecutive data elements in the same cache line
 Temporal Locality
 When an array element is referenced, it is likely to be referenced
again soon
 Arrange code so that data in cache is reused as often as possible
NATIONAL PARTNERSHIP FOR ADVANCED COMPUTATIONAL INFRASTRUCTURE
San DIEGO SUPERCOMPUTER CENTER
20
Typical RISC Floating-Point Operation Times
IBM POWER3 II
• CPU Clock Speed – 375 MHz ( ~ 3 ns)
Instruction
32-Bit
64-Bit
FP Multiply
or Add
3-4
3-4
FP Multiply- 3-4
Add
3-4
FP Square
Root
14-23
22-31
FP Divide
14-21
18-25
NATIONAL PARTNERSHIP FOR ADVANCED COMPUTATIONAL INFRASTRUCTURE
San DIEGO SUPERCOMPUTER CENTER
21
Typical RISC Memory Access Times
IBM POWER3 II
Access
Bandwidth
(GB/s)
Time
(Cycles)
Load Register
From L1
3.2
1
Store Register
To L1
1.6
1
Load/Store L1
from/to L2
6.4
9
Load/Store L1
From/to RAM
1.6
35
NATIONAL PARTNERSHIP FOR ADVANCED COMPUTATIONAL INFRASTRUCTURE
San DIEGO SUPERCOMPUTER CENTER
22
Single CPU Optimization
Optimization of serial (single CPU) version is very important
• Want to parallelize best serial version – where appropriate
NATIONAL PARTNERSHIP FOR ADVANCED COMPUTATIONAL INFRASTRUCTURE
San DIEGO SUPERCOMPUTER CENTER
23
New CPUs in HPC
New CPU designs with new features
• IBM POWER 4
– U Texas Regatta nodes – covered on Wednesday
• Intel IA-64
– SDSC DTF TeraGrid PC Linux Cluster
NATIONAL PARTNERSHIP FOR ADVANCED COMPUTATIONAL INFRASTRUCTURE
San DIEGO SUPERCOMPUTER CENTER
24
Parallel Networks
Network function is to transfer data from source to destination in
support of network transactions used to realize supported
programming model(s).
Data transfer can be for message-passing and/or shared-memory
operations.
 Network Terminology
 Common Parallel Networks
NATIONAL PARTNERSHIP FOR ADVANCED COMPUTATIONAL INFRASTRUCTURE
San DIEGO SUPERCOMPUTER CENTER
25
System Interconnect Topologies
Send Information among CPUs through a Network - Best choice would be a
fully connected network in which each processor has a direct link to every other
processor – Fully Connected Network. This type of network would be very
expensive and difficult to scale ~N*N. Instead, processors are arranged in some
variation of a mesh, torus, hypercube, etc.
3-D Hypercube
2-D Mesh
2-D Torus
NATIONAL PARTNERSHIP FOR ADVANCED COMPUTATIONAL INFRASTRUCTURE
San DIEGO SUPERCOMPUTER CENTER
26
Network Terminology
• Network Latency : Time taken to begin sending a message. Unit is
microsecond, millisecond etc. Smaller is better.
• Network Bandwidth : Rate at which data is transferred from one
point to another. Unit is bytes/sec, Mbytes/sec etc. Larger is better.
– May vary with data size
For IBM Blue Horizon:
Switch
type
Latency (sec)
Bandwidth (MB/sec)
US
~ 17 (~6000 cpu
cycles)
~350
NATIONAL PARTNERSHIP FOR ADVANCED COMPUTATIONAL INFRASTRUCTURE
San DIEGO SUPERCOMPUTER CENTER
27
Network Terminology
Bus
•
Shared data path
•
Data requests require exclusive access
•
Complexity ~ O(N)
•
Not scalable – Bandwidth ~ O(1)
Crossbar Switch
•
Non-blocking switching grid among
network elements
•
Bandwidth ~ O(N)
•
Complexity ~ O(N*N)
Multistage Interconnection Network
(MIN)
•
Hierarchy of switching networks – e.g.,
Omega network for N CPUs, N memory
banks:NATIONAL
complexity
~ O(ln(N))
PARTNERSHIP
FOR ADVANCED COMPUTATIONAL INFRASTRUCTURE
San DIEGO SUPERCOMPUTER CENTER
28
Network Terminology (Continued)
•
•
•
•
•
•
•
Diameter – maximum distance (in nodes) between any two processors
Connectivity – number of distinct paths between any two processors
Channel width – maximum number of bits that can be simultaneously sent
over link connecting two processors = number of physical wires in each link
Channel rate – peak rate at which data can be sent over single physical wire
Channel bandwidth – peak rate at which data can be sent over link = (channel
rate) * (channel width)
Bisection width – minimum number of links that have to be removed to
partition network into two equal halves
Bisection bandwidth – maximum amount of data between any two halves of
network connecting equal numbers of CPUs = (bisection width) * (channel
bandwidth)
NATIONAL PARTNERSHIP FOR ADVANCED COMPUTATIONAL INFRASTRUCTURE
San DIEGO SUPERCOMPUTER CENTER
29
Communication Overhead
Time to send a message of M bytes – simple form:
Tcomm = TL + M*Td + TContention
TL = Message Latency
T = 1byte/bandwidth
Tcontention – Takes into account other network traffic
NATIONAL PARTNERSHIP FOR ADVANCED COMPUTATIONAL INFRASTRUCTURE
San DIEGO SUPERCOMPUTER CENTER
30
Communication Overhead
Time to send a message of M bytes – simple form:
Tcomm = TL + M*Td + TContention
TL = Message Latency
T = 1byte/bandwidth
Tcontention – Takes into account other network traffic
NATIONAL PARTNERSHIP FOR ADVANCED COMPUTATIONAL INFRASTRUCTURE
San DIEGO SUPERCOMPUTER CENTER
31
Parallel I/O
I/O can be limiting factor in parallel application
• I/O system properties – capacity, bandwidth, access
time
• Need support for Parallel I/O in programming system
• Need underlying HW and system support for parallel
I/O
– IBM GPFS – low-level API for developing high-level
parallel I/O functionality – MPI I/O, HDF 5, etc.
–
NATIONAL PARTNERSHIP FOR ADVANCED COMPUTATIONAL INFRASTRUCTURE
San DIEGO SUPERCOMPUTER CENTER
32
Unix OS Concepts for Parallel Programming
Most Operating Systems used by Parallel Computers are Unix-based
• Unix Process (task)
– Executable code
– Instruction pointer
– Stack
– Logical registers
– Heap
– Private address space
– Task forking to create dependent processes – thousands of clock
cycles
• Thread – “lightweight process”
– Logical registers
– Stack
– Shared address space
• Hundreds of clock cycles to create/destroy/synchronize threads
NATIONAL PARTNERSHIP FOR ADVANCED COMPUTATIONAL INFRASTRUCTURE
San DIEGO SUPERCOMPUTER CENTER
33
Parallel Computer Architectures (Flynn Taxonomy)
Control Mechanism
Memory Model
SIMD
MIMD
shared-memory
Hybrid
(SMP cluster)
NATIONAL PARTNERSHIP FOR ADVANCED COMPUTATIONAL INFRASTRUCTURE
San DIEGO SUPERCOMPUTER CENTER
distributed-memory
34
Hardware Architecture Models for Design of
Parallel Programs
Sequential computers - von Neumann model (RAM) is
universal computational model
Parallel computers - no one model exists
• Model must be sufficiently general to encapsulate
hardware features of parallel systems
• Programs designed from model must execute efficiently on
real parallel systems
NATIONAL PARTNERSHIP FOR ADVANCED COMPUTATIONAL INFRASTRUCTURE
San DIEGO SUPERCOMPUTER CENTER
35
Designing and Building Parallel
Applications
Donald Frederick
[email protected]
San Diego Supercomputing Center
NATIONAL PARTNERSHIP FOR ADVANCED COMPUTATIONAL INFRASTRUCTURE
San DIEGO SUPERCOMPUTER CENTER
36
What is Parallel Computing?
• Parallel computing: the use of multiple computers or
processors or processes concurrently working together
on a common task.
– Each processor/process works on its section of the problem
– Processors/process are allowed to exchange information (data in
local memory) with other processors/processes
Grid of Problem to be solved
y
CPU #1 works on this area exchange CPU #2 works on this area
of the problem
of the problem
CPU #3 works on this area
of the problem
exchange
CPU #4 works on this area
of the problem
x
NATIONAL PARTNERSHIP FOR ADVANCED COMPUTATIONAL INFRASTRUCTURE
San DIEGO SUPERCOMPUTER CENTER
37
Shared and Distributed Memory Systems
CPU
CPU
CPU
CPU
M
M
M
M
CPU
CPU
CPU
CPU
Interconnect
MEMORY
NETWORK
Multicomputer
Distributed memory - Each processor
has it’s own local memory. Processes
usually do message passing to exchange
data among processors. Usually multiple
Copies of OS
Mulitprocessor
Shared memory - Single address
space. Processes have access
to a pool of shared memory.
Single OS.
NATIONAL PARTNERSHIP FOR ADVANCED COMPUTATIONAL INFRASTRUCTURE
San DIEGO SUPERCOMPUTER CENTER
38
Hybrid (SMP Clusters) System
CPU
CPU
CPU
CPU
Interconnect
MEMORY
CPU
CPU
CPU
CPU
CPU
Interconnect
CPU
CPU
CPU
Interconnect
MEMORY
MEMORY
Network
•Must/may use message-passing.
•Single or multiple OS copies
•Node-Local operations less costly
than off-node
NATIONAL PARTNERSHIP FOR ADVANCED COMPUTATIONAL INFRASTRUCTURE
San DIEGO SUPERCOMPUTER CENTER
39
Unix OS Concepts for Parallel Programming
Most Operating Systems used are Unix-based
• Unix Process (task)
– Executable code
– Instruction pointer
– Stack
– Logical registers
– Heap
– Private address space
– Task forking to create dependent processes – thousands of clock
cycles
• Thread – “lightweight process”
– Logical registers
– Stack
– Shared address space
• Hundreds of clock cycles to create/destroy/synchronize threads
NATIONAL PARTNERSHIP FOR ADVANCED COMPUTATIONAL INFRASTRUCTURE
San DIEGO SUPERCOMPUTER CENTER
40
Generic Parallel Programming Models
Single Program Multiple Data Stream (SPMD)
– Each CPU accesses same object code
– Same application run on different data
• Data exchange may be handled explicitly/implicitly
– “Natural” model for SIMD machines
– Most commonly used generic parallel programming model
• Message-passing
• Shared-memory
– Usually uses process/task ID to differentiate
– Focus of remainder of this section
Multiple Program Multiple Data Stream (MPMD)
– Each CPU accesses different object code
– Each CPU has only data/instructions needed
– “Natural” model for MIMD machines
NATIONAL PARTNERSHIP FOR ADVANCED COMPUTATIONAL INFRASTRUCTURE
San DIEGO SUPERCOMPUTER CENTER
41
Parallel “Architectures” – Mapping Hardware
Models to Programming Models
Control Mechanism
Memory Model
Programming Model
SIMD
MIMD
shared-memory
Hybrid
(SMP cluster)
SPMD
NATIONAL PARTNERSHIP FOR ADVANCED COMPUTATIONAL INFRASTRUCTURE
San DIEGO SUPERCOMPUTER CENTER
distributed-memory
MPMD
42
Methods of Problem Decomposition for Parallel
Programming
Want to map (Problem + Algorithms + Data) Architecture
Conceptualize mapping via e.g., pseudocode
Realize mapping via programming language
• Data Decomposition - data parallel program
– Each processor performs the same task on different data
– Example - grid problems
• Task (Functional ) Decomposition - task parallel program
– Each processor performs a different task
– Example - signal processing – adding/subtracting frequencies from
spectrum
• Other Decomposition methods
NATIONAL PARTNERSHIP FOR ADVANCED COMPUTATIONAL INFRASTRUCTURE
San DIEGO SUPERCOMPUTER CENTER
43
Designing and Building Parallel
Applications
•Generic Problem Architectures
•Design and Construction Principles
•Incorporate Computer Science Algorithms
•Use Parallel Numerical Libraries Where Possible
NATIONAL PARTNERSHIP FOR ADVANCED COMPUTATIONAL INFRASTRUCTURE
San DIEGO SUPERCOMPUTER CENTER
44
Designing and Building Parallel
Applications
•Know when (not) to parallelize is very important
•Cherri Pancake’s “Rules” summarized:
•Frequency of Use
•Execution Time
•Resolution Needs
•Problem Size
NATIONAL PARTNERSHIP FOR ADVANCED COMPUTATIONAL INFRASTRUCTURE
San DIEGO SUPERCOMPUTER CENTER
45
Categories of Parallel Problems
Generic Parallel Problem “Architectures” ( after G Fox)
•
•
•
•
•
Ideally Parallel (Embarrassingly Parallel, “Job-Level Parallel”)
– Same application run on different data
– Could be run on separate machines
– Example: Parameter Studies
Almost Ideally Parallel
– Similar to Ideal case, but with “minimum” coordination required
– Example: Linear Monte Carlo calculations, integrals
Pipeline Parallelism
– Problem divided into tasks that have to be completed sequentially
– Can be transformed into partially sequential tasks
– Example: DSP filtering
Synchronous Parallelism
– Each operation performed on all/most of data
– Operations depend on results of prior operations
– All processes must be synchronized at regular points
– Example: Modeling Atmospheric Dynamics
Loosely Synchronous Parallelism
– similar to Synchronous case, but with “minimum” intermittent data sharing
– Example: Modeling Diffusion of contaminants through groundwater
NATIONAL PARTNERSHIP FOR ADVANCED COMPUTATIONAL INFRASTRUCTURE
San DIEGO SUPERCOMPUTER CENTER
46
Designing and Building Parallel Applications
Attributes of Parallel Algorithms
– Concurrency - Many actions performed “simultaneously”
– Modularity - Decomposition of complex entities into simpler
components
– Locality - Want high ratio of of local memory access to remote
memory access
– Usually want to minimize communication/computation ratio
– Performance
• Measures of algorithmic “efficiency”
– Execution time
– Complexity usually ~ Execution Time
– Scalability
NATIONAL PARTNERSHIP FOR ADVANCED COMPUTATIONAL INFRASTRUCTURE
San DIEGO SUPERCOMPUTER CENTER
47
Designing and Building Parallel
Applications
Partitioning - Break down main task into smaller ones – either identical or
“disjoint”.
Communication phase - Determine communication patterns for task
coordination, communication algorithms.
Agglomeration - Evaluate task and/or communication structures wrt
performance and implementation costs. Tasks may be combined to
improve performance or reduce communication costs.
Mapping - Tasks assigned to processors; maximize processor utilization,
minimize communication costs. Mapping may be either static or dynamic.
May have to iterate whole process until satisfied with expected performance
– Consider writing application in parallel, using either SPMD message-passing
or shared-memory
– Implementation (software & hardware) may require revisit, additional
refinement or re-design
NATIONAL PARTNERSHIP FOR ADVANCED COMPUTATIONAL INFRASTRUCTURE
San DIEGO SUPERCOMPUTER CENTER
48
Designing and Building Parallel Applications
Partitioning
– Geometric or Physical decomposition (Domain Decomposition) - partition
data associated with problem
– Functional (task) decomposition – partition into disjoint tasks associated with
problem
– Divide and Conquer – partition problem into two simpler problems of
approximately equivalent “size” – iterate to produce set of indivisible subproblems
NATIONAL PARTNERSHIP FOR ADVANCED COMPUTATIONAL INFRASTRUCTURE
San DIEGO SUPERCOMPUTER CENTER
49
Generic Parallel Programming Software Systems
Message-Passing
– Local tasks, each encapsulating local data
– Explicit data exchange
– Supports both SPMD and MPMD
– Supports both task and data decomposition
– Most commonly used
– Process-based, but for performance, processes should be running on separate CPUs
– Example API: MPI, PVM Message-Passing libraries
– MP systems, in particular, MPI, will be focus of remainder of workshop
Data Parallel
– Usually SPMD
– Supports data decomposition
– Data mapping to cpus may be either implicit/explicit
– Example: HPF compiler
Shared-Memory
– Tasks share common address space
– No explicit transfer of data - supports both task and data decomposition
– Can be SPMD, MPMD
– Thread-based, but for performance, threads should be running on separate CPUs
– Example API : OpenMP, Pthreads
Hybrid - Combination of Message-Passing and Shared-Memory - supports both task and data
decomposition
– NExample:
OpenMP +FOR
MPI
ATIONAL PARTNERSHIP
ADVANCED COMPUTATIONAL INFRASTRUCTURE
San DIEGO SUPERCOMPUTER CENTER
50
Programming Methodologies - Practical Aspects
Bulk of parallel programs written in Fortran, C, or C++
•
Generally, best compiler, tool support for parallel program development
Bulk of parallel programs use Message-Passing with MPI
•
Performance, portability, mature compilers, libraries for parallel program development
Data and/or tasks are split up onto different processors by:
• Distributing the data/tasks onto different CPUs, each with local memory (MPPs,MPI)
• Distribute work of each loop to different CPUs (SMPs,OpenMP, Pthreads)
• Hybrid - distribute data onto SMPs and then within each SMP distribute work of each
loop (or task) to different CPUs within the box (SMP-Cluster, MPI&OpenMP)
NATIONAL PARTNERSHIP FOR ADVANCED COMPUTATIONAL INFRASTRUCTURE
San DIEGO SUPERCOMPUTER CENTER
51
Typical Data Decomposition for Parallelism
Example: Solve 2-D Wave Equation:
Original partial
differential equation:

 2
 2
 D 2  B 2
t
x
y
n
n
n
n
n
n
n 1
n
f

2
f

f
f

2
f

f
f

f
i 1, j
i, j
i 1, j
i , j 1
i, j
i , j 1
Finite Difference i
i

D


B

Approximation:
t
x 2
y 2
PE #1
PE #2
PE #3
PE #4
PE #5
PE #6
PE #7
y
PE #0
x
NATIONAL PARTNERSHIP FOR ADVANCED COMPUTATIONAL INFRASTRUCTURE
San DIEGO SUPERCOMPUTER CENTER
52
Sending Data Between CPUs
f i n1, j  2 f i ,nj  f i n1, j
f i ,nj 1  2 f i ,nj  f i ,nj 1
f i n1  f i n
 D
 B
2
t
x
y 2
Sample Pseudo Code
if (taskid=0) then
li = 1
ui = 25
lj = 1
uj = 25
send(1:25)=f(25,1:25)
elseif (taskid=1)then
....
elseif (taskid=2) then
...
elseif(taskid=3) then
...
end if
do j = lj,uj
do i = li,ui
work on f(i,j)
end do
end do
PE #0
PE #1
i=1,25
j=1,25
i=1,25
j=26,50
i=1-25, j=26
i=26-50,j=25
i=25,j=1-25
y
Finite Difference
Approximation:
PE #3
i=25,j=26-50
i=26,j=1-25
i=26,j=26-50
i=26,50
j=1,25
PE #4
i=26,50
j=26,50
i=1-25, j=26
i=26-50,j=25
x
NATIONAL PARTNERSHIP FOR ADVANCED COMPUTATIONAL INFRASTRUCTURE
San DIEGO SUPERCOMPUTER CENTER
53
Typical Task Parallel Decomposition
Process 0
SPECTRUM IN
Subtract
Frequency
f1
Process 1
Subtract
Frequency
f2
Process 2
Subtract
Frequency
f3
SPECTRUM OUT
• Signal processing
• Use one processor for each independent task
v
• Can use more processors if one is overloaded
NATIONAL PARTNERSHIP FOR ADVANCED COMPUTATIONAL INFRASTRUCTURE
San DIEGO SUPERCOMPUTER CENTER
54
Basics of Task Parallel Decomposition - SPMD
Same program will run on 2 different CPUs
Task decomposition analysis has defined 2 tasks (a and b) to be done
by 2 CPUs
program.f:
…
initialize
...
if TaskID=A then
do task a
elseif TaskID=B then
do task b
end if
….
end program
Task A
Execution Stream
program.f:
…
Initialize
…
do task a
…
end program
NATIONAL PARTNERSHIP FOR ADVANCED COMPUTATIONAL INFRASTRUCTURE
San DIEGO SUPERCOMPUTER CENTER
Task B
Execution Stream
program.f:
…
Initialize
…
do task b
…
end program
55
Multi-Level Task Parallelism
Proc set #1
Proc set #2
Proc set #3
Proc set #4
Program tskpar
Implicit none
Program tskpar
Implicit none
Program tskpar
Implicit none
threads
Program tskpar
Implicit none
(declarations)
MPI
Do loop #1
par block
End task #1
(serial work)
Do loop #2
par block
End task #2
(declarations)
MPI
Do loop #1
par block
End task #1
MPI
(serial work)
(serial work)
(declarations)
MPI
Do loop #1
par block
End task #1
MPI
(serial work)
(declarations)
Do loop #1
par block
End task #1
MPI
(serial work)
Do loop #2
par block
End task #2
Do loop #2
par block
End task #2
Do loop #2
par block
End task #2
(serial work)
(serial work)
(serial work)
MPI
network
MPI
MPI
Implementation: MPI and OpenMP
NATIONAL PARTNERSHIP FOR ADVANCED COMPUTATIONAL INFRASTRUCTURE
San DIEGO SUPERCOMPUTER CENTER
56
Parallel Application Performance Concepts
•
•
•
•
Parallel Speedup
Parallel Efficiency
Parallel Overhead
Limits on Parallel Performance
NATIONAL PARTNERSHIP FOR ADVANCED COMPUTATIONAL INFRASTRUCTURE
San DIEGO SUPERCOMPUTER CENTER
57
Parallel Application Performance Concepts
• Parallel Speedup - ratio of best sequential time to parallel
execution time
– S(n) = ts/tp
• Parallel Efficiency - fraction of time processors in use
– E(n) = ts/(tp*n) = S(n)/n
• Parallel Overhead
– Communication time (Message-Passing)
– Process creation/synchronization (MP)
– Extra code to support parallelism, such as Load Balancing
– Thread creation/coordination time (SMP)
• Limits on Parallel Performance
NATIONAL PARTNERSHIP FOR ADVANCED COMPUTATIONAL INFRASTRUCTURE
San DIEGO SUPERCOMPUTER CENTER
58
Parallel Application Performance Concepts
• Parallel Speedup - ratio of best sequential time to parallel
execution time
– S(n) = ts/tp
• Parallel Efficiency - fraction of time processors in use
– E(n) = ts/(tp*n) = S(n)/n
• Parallel Overhead
– Communication time (Message-Passing)
– Process creation/synchronization (MP)
– Extra code to support parallelism, such as Load Balancing
– Thread creation/coordination time (SMP)
• Limits on Parallel Performance
NATIONAL PARTNERSHIP FOR ADVANCED COMPUTATIONAL INFRASTRUCTURE
San DIEGO SUPERCOMPUTER CENTER
59
Limits of Parallel Computing
• Theoretical upper limits
– Amdahl’s Law
– Gustafson’s Law
• Practical limits
– Communication overhead
– Synchronization overhead
– Extra operations necessary for parallel version
• Other Considerations
– Time used to re-write (existing) code
NATIONAL PARTNERSHIP FOR ADVANCED COMPUTATIONAL INFRASTRUCTURE
San DIEGO SUPERCOMPUTER CENTER
60
Parallel Computing - Theoretical Performance
Upper Limits
• All parallel programs contain:
– Parallel sections
– Serial sections
Serial sections limit the parallel performance
Amdahl’s Law provides a theoretical upper limit on
parallel performance for size-constant problems
NATIONAL PARTNERSHIP FOR ADVANCED COMPUTATIONAL INFRASTRUCTURE
San DIEGO SUPERCOMPUTER CENTER
61
Amdahl’s Law
• Amdahl’s Law places a strict limit on the speedup that can be realized
by using multiple processors
– Effect of multiple processors on run time for size-constant
problems
– Effect of multiple processors on parallel speedup, S:
tn  fp / N  fs t1
– Where
•
•
•
•
S
1
fs  f p / N
fs = serial fraction of code
fp = parallel fraction of code
N = number of processors
t1 = sequential execution time
NATIONAL PARTNERSHIP FOR ADVANCED COMPUTATIONAL INFRASTRUCTURE
San DIEGO SUPERCOMPUTER CENTER
62
Amdahl’s Law
Amdahl's Law (Ideal Case)
50
Speedup
40
f=0.0
f=0.01
f=0.05
f=0.1
30
20
10
0
10
20
30
40
Number CPUs
NATIONAL PARTNERSHIP FOR ADVANCED COMPUTATIONAL INFRASTRUCTURE
San DIEGO SUPERCOMPUTER CENTER
63
Amdahl’s Law (Continued)
Amdahl's Law (Actual)
50
Speedup
40
f=0.0
f=0.01
Actual
30
20
10
0
10
20
30
40
Number CPUs
NATIONAL PARTNERSHIP FOR ADVANCED COMPUTATIONAL INFRASTRUCTURE
San DIEGO SUPERCOMPUTER CENTER
64
Gustafson’s Law
Consider scaling problem size as processor count increased
Ts = serial execution time
Tp(N,W) = parallel execution time for same problem, size W, on N CPUs
S(N,W) = Speedup on problem size W, N CPUs
S(N,W) = (Ts + Tp(1,W) )/( Ts + Tp(N,W) )
Consider case where Tp(N,W) ~ W*W/N
S(N,W) -> (N*Ts + N*W*W)/(N*Ts + W*W) -> N
Gustafson’s Law provides some hope for parallel applications to deliver
on the promise.
NATIONAL PARTNERSHIP FOR ADVANCED COMPUTATIONAL INFRASTRUCTURE
San DIEGO SUPERCOMPUTER CENTER
65
Parallel Programming Analysis - Example
Consider solving 2-D Poisson’s equation by iterative method on a regular
grid with M points –
NATIONAL PARTNERSHIP FOR ADVANCED COMPUTATIONAL INFRASTRUCTURE
San DIEGO SUPERCOMPUTER CENTER
66
Parallel Programming Concepts
Program must be correct and terminate for some input data
set(s)
• Race condition – result(s) depends upon order in which
processes/threads finish calculation(s). May or may not be
problem, depending upon results
• Deadlock – Process/thread requests resource it will never
get. To be avoided – common problem in message-passing
parallel programs
NATIONAL PARTNERSHIP FOR ADVANCED COMPUTATIONAL INFRASTRUCTURE
San DIEGO SUPERCOMPUTER CENTER
67
Other Considerations
• Writing efficient parallel applications is usually more difficult than
writing serial applications
– Serial version may (may not) provide good starting point for
parallel version
– Communication, synchronization, etc., can limit parallel
performance
• Usually want to overlap communication and computation to minimize ratio of
communication to computation time
– Serial time can dominate
– CPU computational load balance is important
• Is it worth your time to rewrite existing application? Or create new
one? Recall Cherri Pancake’s Rules (simplified version).
– Do the CPU and/or memory requirements justify parallelization?
– Will the code be used “enough” times to justify parallelization?
NATIONAL PARTNERSHIP FOR ADVANCED COMPUTATIONAL INFRASTRUCTURE
San DIEGO SUPERCOMPUTER CENTER
68
Parallel Programming - Real Life
• These are the main models in use today (circa 2002)
• New approaches – languages, hardware, etc., are likely to arise as
technology advances
• Other combinations of these models are possible
• Large applications will probably use more than one model
• Shared memory model is closest to mathematical model of application
– Scaling to large numbers of cpus is major issue
NATIONAL PARTNERSHIP FOR ADVANCED COMPUTATIONAL INFRASTRUCTURE
San DIEGO SUPERCOMPUTER CENTER
69
Parallel Computing
References
• NPACI PCOMP web-page - www.npaci.edu/PCOMP
– Selected HPC link collection - categorized, updated
• Online Tutorials, Books
– Designing and Building Parallel Programs, Ian Foster.
http://www-unix.mcs.anl.gov/dbpp/
– NCSA Intro to MPI Tutorial
http://pacont.ncsa.uiuc.edu:8900/public/MPI/index.html
– HLRS Parallel Programming Workshop
http://www.hlrs.de/organization/par/par_prog_ws/
• Books
– Parallel Programming, B. Wilkinson, M. Allen
– Computer Organization and Design, D. Patterson and J. L.
Hennessy
– Scalable Parallel Computing, K. Huang, Z. Xu
NATIONAL PARTNERSHIP FOR ADVANCED COMPUTATIONAL INFRASTRUCTURE
San DIEGO SUPERCOMPUTER CENTER
70