负载平衡

Transcript 负载平衡

Lecture 8:
Chapter 7
Load Balancing and Termination
Detection
负载平衡与终止检测
Load balancing –通过在处理器间均衡地分配计算任
务获得最高可能的执行速度。
Termination detection – 检测一次计算什么时候结束。
当计算是分布平行的，检测比较苦难.
负载平衡
静态的负载平衡
调度问题
可以在任何进程执行之前尝试静态负载平衡
Some potential static load balancing techniques:
• Round robin algorithm — passes out tasks in sequential
order of processes coming back to the first when all
processes have been given a task
• Randomized algorithms — selects processes at random
to take tasks
• Recursive bisection — recursively divides the problem
into sub-problems of equal computational effort while
minimizing message passing
• Simulated annealing — an optimization technique
• Genetic algorithm — another optimization technique
静态平衡的几个根本缺陷:
• 如果实际上不执行程序的各个部分，很难去估计
程序各个部分的执行时间.
• 在不同情况下，通信延迟也不一样。
• 有些问题的求解步数不确定。
动态负载平衡
实时调度
在进程执行过程中考虑动态负载平衡。
考虑前面提到的因素，根据正在运行的各部分来划
分各部分的负载。
会造成额外的开销，但比静态的更高效。
Processes and Processors
Computation will be divided into work （工作）or
tasks （任务） to be performed, and processes
perform these tasks. Processes are mapped onto
processors.
Since our objective is to keep the processors busy,
we are interested in the activity of the processors.
Dynamic Load Balancing
Can be classified as:
• Centralized集中式
• Decentralized分散式
Centralized dynamic load balancing
Tasks 集中在一起从一个中心出发， Masterslave structure.
Decentralized dynamic load balancing
任务在任意进程间传递.
一组工作进程(worker processes) 对问题进行工作，
它们之间相互交互，最后向一个独立进程报告。
一个工作进程可以从其他工作进程处接收任务，也
可以向其他工作进程发送任务。
集中式动态负载平衡
Master process持有要执行的任务集合。
任务发到从进程. 从进程完成一个任务后, 会主动向
主进程请求另一个任务.
work pool, replicated worker, processor farm.
集中式工作池
终止条件
当下面两项都满足时计算就终止:
• 任务队列是空的 and
• 每个从进程为空闲状态，且已经请求了另一个任务，
而没有任何新的任务产生。
队列空不是计算终止的充分条件：如果有进程还在
运行，这些运行的进程可能为队列产生新的任务。
Decentralized Dynamic Load Balancing
分布式工作池
Fully Distributed Work Pool
全分布式工作池：进程间相互执行任务。
Task Transfer Mechanisms
任务传递机制
Receiver-Initiated Method接收器启动法
一个进程向它选择的别的进程请求任务。
特别是，当一个有很少或没有任务可执行时，会向其他
进程请求任务。
该方法在系统负载很重时会工作的很好！
缺点：确定进程的负载代价昂贵。
Sender-initiated method 发送器启动方法：
一个进程向它选择的别的进程发送任务。
一个负载很重的的进程向其他进程传递一些
它的任务。
该方法在系统负载较轻时会工作的很好！
另一种选择是将两种方法结合起来. 轻负载进程向重
负载进程请求任务，重负载进程向轻负载进程发送
任务。缺点：确定进程的负载代价昂贵。
注意：在系统负载很重时，由于缺少可用进程，负
载平衡也可能会很难实现。
从进程间请求任务的分散式选择算法
本地进程选择算法
进程选择算法
Algorithms for selecting a process:
Round robin algorithm（循环算法） – process Pi
requests tasks from process Px, where x is given
by a counter that is incremented after each
request, using modulo n arithmetic (n processes),
excluding x = i.
Random polling algorithm（随机轮询） – process
Pi requests tasks from process Px, where x is a
number that is selected randomly between 0 and
n - 1 (excluding i).
使用线性结构的负载平衡
任
务
队
列
Master process (P0) feeds queue with tasks at one
end, and tasks are shifted down queue.
When a process, Pi (1 <= i < n), detects a task at its
input from queue and process is idle, it takes task
from queue.
Then tasks to left shuffle down queue so that space
held by task is filled. A new task is inserted into e left
side end of queue.
Eventually, all processes have a task and queue
filled with new tasks. High-priority or larger tasks
could be placed in queue first.
Shifting Actions
Could be orchestrated by using messages between
adjacent processes:
• For left and right communication
• For the current task
Code Using Time Sharing Between
Communication and Computation
Master process (P0)
Process Pi (1 < i < n)
Nonblocking nrecv() necessary to check for a request being
received from right.
Nonblocking Receive Routines
MPI
Nonblocking receive, MPI_Irecv(), returns a
request “handle,” which is used in subsequent
completion routines to wait for the message or to
establish whether message has actually been
received at that point (MPI_Wait() and MPI_Test(),
respectively).
In effect, nonblocking receive, MPI_Irecv(), posts a
request for message and returns immediately.
使用树结构的负载平衡
Tasks passed from node into one of the two nodes below it
when node buffer empty.
分布式任务终止检测算法
Termination Conditions
在时刻t满足下面条件使分布式任务终止
• Application-specific local termination conditions exist
throughout the collection of processes, at time t.
• There are no messages in transit between processes at time t.
第二条件是必要的，因为传送中的消息也许会重新启动一个已经
终止的进程。
第二条件较难识别，消息在进程间传递的时间预先是不知道的。
通用的分布式终止算法
Each process in one of two states:
1. Inactive - without any task to perform
2. Active
Process that sent task to make a process enter
the active state becomes its “parent.”
1.进程开始：进程处于inactive状态；
2.收到第一个任务：该进程处于active状态，发送该任务
的进程为“父进程”
3.收到第二个及以上任务：收到任务就马上发送确认消息.
当该进程准备变成inactive状态时才向“父进程” 发送确
认消息。满足下列条件才能进入inactive状态：
• 其本地终止条件已满足 (所有分得的任务已经完成),
• 它对收到的所有任务发送了确认。
• 它收到了它发出去的任务的确认。
一个进程必须在父进程之前变成inactive，当第一个进程
空闲时计算终止.
Termination using message
acknowledgments用消息确认实现终止
Ring Termination Algorithms环形终止算法
Single-pass ring termination algorithm
单通环形终止算法
1. When P0 terminated, it generates token（令牌） passed to P1.
2. When Pi (1 <=i < n) receives token and has already terminated,
it passes token onward to Pi+1. Otherwise, it waits for its local
termination condition and then passes token onward. Pn-1
passes token to P0.
3. When P0 receives a token, it knows that all processes in the ring
have terminated. A message can then be sent to all processes
informing them of global termination, if necessary.
Ring termination detection algorithm
传递令牌
Algorithm assumes that a process cannot be reactivated after
reaching its local termination condition. Does not apply to work pool
problems in which a process can pass a new task to an idle process
算法假定一个进程到达本地终止条件后不能被重新激活，这种假设不适
用工作池问题，在工作池问题中一个进程可向一空闲进程传送一个新的
任务。
Dual-Pass Ring Termination Algorithm
双通环形终止算法
双通环形能够处理进程重新激活的问题
如果令牌已经经过Pj 现在已经传给Pi （ j < i ），此时Pi继
续往下传，但同时Pi又回传给Pj一个任务，Pj被重新激活
了。若发生这种情况，令牌必须第二次沿着环巡回。
为了区别这种情形，令牌被分成白色和黑色两种，进程也
分成白色和黑色两种。
接到黑色令牌意味着全局终止还没发生，令牌必须沿着环
重新流动.
Passing task to previous processes
发白色令牌
白色令牌
黑色令牌
黑色进程，
把黑色令牌传出去后变白色进程
黑色进程会把令牌染成黑色，白色进程会让令牌以原来的颜色前传。
P0收到黑色令牌，它就发一个白色令牌，若收到白色令牌，则stop
树形终止算法
传递令牌
令牌1
令牌2
若有m个叶子，根root就得收到m个令牌，然后从root通过树广播
给所有进程，才能全局终止。
Fixed Energy Distributed Termination Algorithm
固定能量分布式终止算法
固定数值的 “energy”，相当带数值的令牌。
• 系统开始时，所有能量由主进程持有.
• 主进程把部分能量与任务传送给请求任务的进程.
• 如果这些进程收到任务请求，它会把能量进一步划分传送下
去。
• 若进程空闲，它要在请求新的任务之前把持有的能量返回。
• 一个进程只有在它发出的能量均已返回，并且组合到所持有
的总能量中后，才会交回其能量。
• 当所有能量返回到主进程，且主进程空闲时，所有进程必定
空闲，计算终止。
严重缺陷 –计算精度可能导致各部分能量之和不等于原始总能量.
Load balancing/termination detection
Example举例
Shortest Path Problem最短路问题
Finding the shortest distance between two points on a graph.
It can be stated as follows:
Given a set of interconnected nodes where the links
between the nodes are marked with “weights,” find
the path from one specific node to another specific
node that has the smallest accumulated weights.
The interconnected nodes can be described by a graph.
The nodes are called vertices, and the links are called
edges.
If the edges have implied directions (that is, an edge can
only be traversed in one direction, the graph is a directed
graph.
Example:
The Best Way to Climb a Mountain
Graph of mountain climb
Weights in graph indicate amount of effort that would be expended
in traversing the route between two connected camp sites.
The effort in one direction may be different from the effort in the
opposite direction (downhill instead of uphill!). (directed graph)
Graph Representation
Two basic ways that a graph can be represented in a program:
1. Adjacency matrix(邻接矩阵) — a two-dimensional array, a, in
which a[i][j] holds the weight associated with the edge between
vertex i and vertex j if one exists
2. Adjacency list （邻接表） — for each vertex, a list of vertices
directly connected to the vertex by an edge and the
corresponding weights associated with the edges
Adjacency matrix used for dense graphs. Adjacency list used for
sparse graphs.
Difference based upon space (storage) requirements. Accessing
the adjacency list is slower than accessing the adjacency matrix.
Representing the graph
Searching a Graph
Two well-known single-source shortest-path algorithms:
• Moore’s single-source shortest-path algorithm (Moore,
1957)
• Dijkstra’s single-source shortest-path algorithm (Dijkstra,
1959)
which are similar.
Moore’s algorithm is chosen because it is more amenable to
parallel implementation although it may do more work.
The weights must all be positive values for the algorithm to
work.
Moore’s Algorithm
动态规划思想的算法
迭代算法：从源顶点开始，若当前迭代考虑顶点i时，找出从源顶
点经过顶点i到其他各顶点的最短距离
当前从源点出发到j点的最短距离dj，wi,j 是顶点 i到顶点 j的边权:
改变最短距离如下
dj = min(dj, di + wi,j)
Moore’s Shortest-path Algorithm
Moore’s Shortest-path Algorithm
dj
j
wij
源点
i
终点
初始队列
队列
当前队列
每次迭代距离被替换的顶点j需要加到队列中
队列中不能有终点
Data Structures
因为是有向搜索，故建立个先进先出的队列（vertex
queue）来保存要检查的顶点. 初始, 只有源顶点在队列中
当前从源顶点到顶点i的最短距离保存在 array dist[i]. 初始
赋值“无穷大”
Code
Suppose w[i][j] holds the weight of the edge from vertex i
and vertex j (infinity if no edge). The code could be of the
form
newdist_j = dist[i] + w[i][j];
if (newdist_j < dist[j]) dist[j] = newdist_j;
When a shorter distance is found to vertex j, vertex j is
added to the queue (if not already in the queue), which will
cause vertex j to be examined again
Stages in Searching a Graph
Example
The initial values of the two key data structures are
检查从顶点A出发的每条边 AB
A出队列，更新A到B的最短距离，把B加到队列
AB
检查B发出的边 BF, BE, BD, BC::
B出队列，更新A到F，E，D，C的最短距离，把除了终点F外的E，D, C加入队列
AF=min（AF，AB+BF）=min（∞，10+51）=61
AE=min（AE，AB+BE）=min（ ∞，10+24）=34
AD=min（AD，AB+BD)
AC=min (AC, AB+BD)
AB ABC ABD ABE ABF
检查E发出的所有边 EF
E出队列，更新A到F的最短距离，终点F不加入队列
51
AB ABC ABD ABE ABEF
AF=min(AF, AE+EF)=min(61,34+17)=51
检查D发出的所有边DE:
D出队列，计算A到E的最短距离，更新A到E距离，把E加到队列
51
AE=min（AE，AD+DE）=min（34，23+9）=32
AB ABC ABD ABDE ABEF
再检查C发出的所有边CD， AD=min（AD, AC+CD）=min（23，18+14）=23，
C出队列，但AD没有更新还是原来的AD，D不进队列。
再一遍检测E发出的边EF，E出队列，计算AF=min
（AF，AE+EF）=min（51，32+17）=49，更新AF
AB ABC ABD ABDE ABDEF
No more vertices to consider. Have minimum distance from
vertex A to each of the other vertices, including destination
vertex, F.
Usually, path required in addition to distance. Then, path
stored as distances recorded. Path in our case is A -> B ->
D -> E ->F.
Sequential Code
Let next_vertex() return the next vertex from the vertex
queue or no_vertex if none.
Assume that adjacency matrix used, named w[ ][ ].
Parallel Implementations
Centralized Work Pool
Centralized work pool holds vertex queue, vertex_queue[]
as tasks.
Each slave takes vertices from vertex queue and returns
new vertices.
Since the structure holding the graph weights is fixed, this
structure could be copied into each slave, say a copied
adjacency matrix.
Decentralized Work Pool
任务队列vertex_queue[]和dist[]可以是分布式的。每个进程i对应
一个顶点i，该进程存储该顶点的顶点队列项（从该顶点出发的
邻点），及当前源顶点到该顶点i的最短距离。
每个进程通过存储邻接矩阵w[i][]来找邻点和边权 .
Search Algorithm
若进程i，当前最短距离dist,对于它的一个i到j的边，计算
dj=di+w[i][j];就把dj传给进程j，
进程j收到dj后，与自己的最短距离dist比较，若dj<dist,就更新
dist=dj，并且计算从j发出的所有边jk，jl, jh,……将dk，dl，
dh….发给进程k，l，h，……..
若dj>=dist，不更新
Distributed graph search
Mechanism necessary to repeat actions and terminate
when all processes idle - must cope with messages in
transit.
Simplest solution
Use synchronous message passing, in which a process
cannot proceed until destination has received message.
Process only active after its vertex is placed on queue.
Possible for many processes to be inactive, leading to an
inefficient solution.
Impractical for a large graph if one vertex is allocated to
each processor. Group of vertices could be allocated to
each processor.
Example： Floyd algorithm
64
Slides for Parallel Programming Techniques & Applications Using Networked Workstations & Parallel Computers 2nd ed., by B. Wilkinson & M. Allen, @ 2004 Pearson Education Inc. All rights reserved.

负载平衡

Transcript 负载平衡

Directory