Parallel Programming
Download
Report
Transcript Parallel Programming
FOUNDATION TO PARALLEL
PROGRAMMING
2
CONTENT
• 并行程序设计简介
• 并行程序设计模型
• 并行程序设计范型
3
Parallel Programming is a Complex Task
• 并行软件开发人员面临的问题:
– 不确定性
– 通讯
– 同步
– 划分与分发
– 负载平衡
– 容错
– 竞争
– 死锁
– ...
4
Levels of Parallelism
PVM/MPI
Threads
Compilers
CPU
Task i-l
func1 ( )
{
....
....
}
a ( 0 ) =..
b ( 0 ) =..
+
Task i
func2 ( )
{
....
....
}
a ( 1 )=..
b ( 1 )=..
x
Task i+1
func3 ( )
{
....
....
}
a ( 2 )=..
b ( 2 )=..
Load
Code-Granularity
Code Item
Large grain
(task level)
Program
Medium grain
(control level)
Function (thread)
Fine grain
(data level)
Loop (Compiler)
Very fine grain
(multiple issue)
With hardware
5
Responsible for Parallelization
Grain Size Code Item
Parallelised by
Very Fine
Instruction
处理器
Fine
Loop/Instruction block
编译器
Medium
(Standard one page) Function
程序员
Large
Program/Separate heavy-weight
process
程序员
6
Parallelization Procedure
Assignment
Decomposition
Sequential Computation
Tasks
Process Elements
Mapping
Processors
Orchestration
7
Sample Sequential Program
FDM (Finite Difference Method)
…
loop{
for (i=0; i<N; i++){
for (j=0; j<N; j++){
a[i][j] = 0.2 * (a[i][j-1] +
+ a[i+1][j] + a[i][j]);
}
}
}
…
a[i][j+1] + a[i-1][j]
8
Parallelize the Sequential Program
• Decomposition
…
loop{
for (i=0; i<N; i++){
for (j=0; j<N; j++){
a[i][j] = 0.2 * (a[i][j-1] + a[i][j+1]
+ a[i-1][j] + a[i+1][j] + a[i][j]);
}
}
}
…
a task
9
Parallelize the Sequential Program
• Assignment
PE
Divide the tasks
equally among
process elements
PE
PE
PE
10
Parallelize the Sequential Program
• Orchestration
PE
need to
communicate
and to
synchronize
PE
PE
PE
11
Parallelize the Sequential Program
• Mapping
PE
PE
PE
PE
Multiprocessor
12
Parallel Programming Models
• Sequential Programming Model
• Shared Memory Model (Shared Address Space
Model)
•
•
•
•
DSM
Threads/OpenMP (enabled for clusters)
Cilk
Java threads
• Message Passing Model
• PVM
• MPI
• Functional Programming
• MapReduce
13
Parallel Programming Models
• Partitioned Global Address Space Programming
(PGAS) Languages
•
UPC, Coarray Fortran, Titanium
• Languages and Paradigm for Hardware
Accelerators
•
CUDA, OpenCL
• Hybrid: MPI + OpenMP + CUDA/OpenCL
trends
Scalar Application
Vector
MPP System, Message Passing: MPI
Distributed
memory
Multi core nodes: OpenMP,…
Shared
Memory
Accelerator (GPGPU,
FPGA): Cuda,
OpenCL,..
Hybrid codes
15
Sequential Programming Model
• Functional
• Naming: Can name any variable in virtual
address space
• Hardware (and perhaps compilers) does
translation to physical addresses
• Operations: Loads and Stores
• Ordering: Sequential program order
16
Sequential Programming Model
• Performance
• Rely on dependences on single location
(mostly): dependence order
• Compiler: reordering and register allocation
• Hardware: out of order, pipeline bypassing,
write buffers
• Transparent replication in caches
17
SAS (Shared Address Space)
Programming Model
System
Thread
(Process)
Thread
(Process)
read(X)
write(X)
X
Shared variable
18
Shared Address Space Programming
Model
• 变量命名
• 任何进程在共享空间里可以命名任何变量
• Operations
• Loads and stores, plus those needed for ordering
• Simplest Ordering Model
• 在进程/线程内: sequential program order
• 线程之间: 存在交叉 (类似于分时里面的交叉)
• Additional orders through synchronization
19
Synchronization
• Mutual exclusion (locks)
• No ordering guarantees
• Event synchronization
• Ordering of events to preserve dependences
• e.g. producer —> consumer of data
20
MP Programming Model
Node A
Node B
process
process
send (Y)
receive (Y’)
Y
Y’
message
21
Message-Passing Programming
Model
Match
ReceiveY, P, t
AddressY
Send X, Q, t
AddressX
Local process
address space
Local process
address space
ProcessP
Process Q
• Send指定待传输的数据缓存以及接受进程
• Recv指定发送进程以及存放接受数据的存储空间
• 用户进程只能在进程地址空间里命名局部变量和实体
• 存在许多开销:拷贝、缓存管理、保护
22
Message Passing Programming Model
• 命名
– 进程可以直接命名局部变量
– 不存在共享地址空间
• Operations
– 明确通信: send 和receive
– Send从私有空间传输数据到另外一个进程
– Receive 拷贝数据到私有地址空间
– 必须能够命名进程
23
Message Passing Programming Model
• Ordering
• 进程里面由程序确定顺序
• Send和receive提供了进程间点对点的同步
• 可以构建全局地址空间
• 例如:进程id + 进程地址空间内部地址
• 但对其不存在直接操作
Functional Programming
• 函数操作不会更改数据结构,而是创建新的数据结
构
• 原来数据始终未改
• 数据流动未明确在程序设计中确定
• 操作的顺序并不重要
Functional Programming
fun foo(l: int list) =
sum(l) + mul(l) + length(l)
Order of sum() and mul(), etc does not matter – they do
not modify l
GPU
• Graphical Processing Unit
• 一个GPU由大量的核组成,比如上百个核.
• 但通常CPU包含 2, 4, 8或12个核
• Cores? – 芯片里至少共享内存或L1 cache的处理
单元
• General Purpose computation using GPU in
applications other than 3D graphics
• GPU accelerates critical path of application
CPU v/s GPU
GPU and CPU
• Typically GPU and CPU coexist in a heterogeneous
setting
• “Less” computationally intensive part runs on CPU
(coarse-grained parallelism), and more intensive parts run
on GPU (fine-grained parallelism)
• NVIDIA’s GPU architecture is called CUDA (Compute
Unified Device Architecture) architecture, accompanied by
CUDA programming model, and CUDA C language
What is CUDA?
CUDA: Compute Unified Device
Architecture.
A parallel computing architecture
developed by NVIDIA.
The computing engine in GPU.
CUDA gives developers access to the
instruction set and memory of the
parallel computation elements in GPUs.
Processing Flow
CUDA的处理流:
从主存拷贝数据到GPU内
存
CPU启动GPU上的计算进
程.
GPU在每个核上并行执行
从GPU内存拷贝结果到主
存
CUDA Programming Model
Definitions:
Device = GPU
Host = CPU
Kernel =
function that
runs on the
device
CUDA Programming Model
A kernel is executed by a grid of thread
blocks
A thread block is a batch of threads
that can cooperate with each other by:
Sharing data through shared memory
Synchronizing their execution
Threads from different blocks cannot
cooperate
CUDA Kernels and Threads
Parallel portions of an application are
executed on the device as kernels
One kernel is executed at a time
Many threads execute each kernel
Differences between CUDA and CPU threads
CUDA threads are extremely lightweight
Very little creation overhead
Instant switching
CUDA uses 1000s of threads to achieve efficiency
Multi-core CPUs can use only a few
Arrays of Parallel Threads
A CUDA kernel is executed by an array of threads
All threads run the same code
Each thread has an ID that it uses to compute memory
addresses and make control decisions
Minimal Kernels
Manage memory
CPU v/s GPU
© NVIDIA Corporation 2009
Partitioned Global Address Space
• Most parallel programs are written using either:
• Message passing with a SPMD model (MPI)
• Usually for scientific applications with C++/Fortran
• Scales easily
• Shared memory with threads in OpenMP, Threads+C/C++/F or Java
• Usually for non-scientific applications
• Easier to program, but less scalable performance
• Partitioned Global Address Space (PGAS) Languages take the
best of both
• SPMD parallelism like MPI
• Local/global distinction, i.e., layout matters
• Global address space like threads (programmability)
39/86
How does PGAS compare to other models?
Process/Thread
Address Space
Message passing
Shared Memory
PGAS
MPI
OpenMP
UPC, CAF, X10
• 计算在多个places执行.
• Place包含可以被运端进程
操作的数据
• 数据在生命周期里存在于
创建该数据的place
• 一个place的数据可以指向另
外place的数据.
• 数据结构 (e.g. arrays) 可以分
布到多个places.
A place expresses locality.
40
PGAS Overview
• “Partitioned Global
View” (or PGAS)
• Global Address
Space: 每一线程可
以看到全部数据,所
以不需要复制数据
• Partitioned: 将全局
地址空间分割,程序
员意识到线程之间的
数据共享
• 实现
• GA Library from PNNL
• Unified Parallel C (UPC),
FORTRAN 2009
• X10, Chapel
• 概念
• 内存和结构
• Partition and mapping
• Threads and affinity
• Local and non-local
accesses
• Collective operations and
“Owner computes”
41
Memories and Distributions
• Software Memory
• Distinct logical storage area in a computer
program (e.g., heap or stack)
• For parallel software, we use multiple
memories
• Structure
• Collection of data created by program
execution (arrays, trees, graphs, etc.)
• Partition
• Division of structure into parts
• Mapping
• Assignment of structure parts to memories
42
Software Memory Examples
• Executable Image at
right
• “Program linked, loaded
and ready to run”
• Memories
• Static memory
• data segment
• Heap memory
• Holds allocated structures
• Explicitly managed by
programmer (malloc, free)
• Stack memory
• Holds function call records
• Implicitly managed by
runtime during execution
43
Affinity and Nonlocal Access
• Affinity是线程与内存的关
联
• 如果线程与内存存在关系,
它可以存取它的结构
• 这些的内存称为局部内存
• 非局部访问
• Thread 0 需要part B
• Part B in Memory 1
• Thread 0跟memory 1没有关
系
• 非局部访问通常通过进程
之间通信实现,因此开销
较大
45
Threads and Memories for Different
Programming Methods
Thread
Count
Memory
Count
Nonlocal Access
1
1
N/A
Either 1 or p
1
N/A
p
p
No. Message required.
1 (host) +
p (device)
2 (Host +
device)
No. DMA required.
UPC, FORTRAN
p
p
Supported.
X10
n
p
Supported.
Sequential
OpenMP
MPI
CUDA
Hybrid (MPI+OpenMP+CUDA+…
• Take the positive off all models
• Exploit memory hierarchy
• Many HPC applications are adopting this model
• Mainly due to developer inertia
• Hard to rewrite million of source lines
Hybrid parallel programming
Python: Ensemble simulations
MPI: Domain partition
OpenMP: External loop partition
CUDA: assign inner loops
Iteration to GPU threads
48
Design Issues Apply at All Layers
• Programming model’s position provides
constraints/goals for system
• In fact, each interface between layers
supports or takes a position on
– Naming model
– Set of operations on names
– Ordering model
– Replication
– Communication performance
49
Naming and Operations
Naming and operations in programming
model can be directly supported by lower
levels, or translated by compiler, libraries
or OS
Example: Shared virtual address space in
programming model
Hardware interface supports shared physical
address space
Direct support by hardware through v-to-p mappings, no
software layers
50
Naming and Operations (Cont’d)
Hardware supports independent physical
address spaces
system/user interface: can provide SAS through
OS
v-to-p mappings only for data that are local
remote data accesses incur page faults; brought in via page
fault handlers
Or through compilers or runtime, so above sys/user
interface
51
Naming and Operations (Cont’d)
Example: Implementing Message Passing
Direct support at hardware interface
Support at sys/user interface or above in
software (almost always)
Hardware interface provides basic data transport
Send/receive built in software for flexibility
(protection, buffering)
Or lower interfaces provide SAS, and
send/receive built on top with buffers and
loads/stores
52
Naming and Operations (Cont’d)
• Need to examine the issues and tradeoffs at every
layer
• Frequencies and types of operations, costs
• Message passing
• No assumptions on orders across processes except those
imposed by send/receive pairs
• SAS
• How processes see the order of other processes’ references
defines semantics of SAS
• Ordering very important and subtle
53
Ordering model
• Uniprocessors play tricks with orders to gain
parallelism or locality
• These are more important in multiprocessors
• Need to understand which old tricks are valid,
and learn new ones
• How programs behave, what they rely on, and
hardware implications
54
Parallelization Paradigms
• Task-Farming/Master-Worker
• Single-Program Multiple-Data (SPMD)
• Pipelining
• Divide and Conquer
• Speculation.
Master Worker/Slave Model
• Master将问题分解
成小任务,将任务
分发到workers执
行,然后收集结果
形成最终结果.
• 映射/负载平衡
• 静态
• 动态
Static
55
Single-Program Multiple-Data
• 每一进程执行同样的
代码,但是处理不同
的数据。
• 领域分解,数据并行
56
57
Pipelining
• 适合
• 细粒度的并行
• 多阶段执行的应用
分治法Divide and Conquer
• 问题分解成多个子问题,每
一子问题独立求解,合并各
结果
• 3种操作: split, compute, 和
join.
• Master-worker/task-farming
同分治法类似:master运行
split和join操作
• 形式上类似于层次masterwork
58
59
猜测并行Speculative Parallelism
• 适合问题之间存在复杂的依赖关系
• 采用“look ahead “execution.
• 使用多种算法解决问题