Transcript 负载平衡
Lecture 8: Chapter 7 Load Balancing and Termination Detection 负载平衡与终止检测 Load balancing –通过在处理器间均衡地分配计算任 务获得最高可能的执行速度。 Termination detection – 检测一次计算什么时候结束。 当计算是分布平行的,检测比较苦难. 负载平衡 静态的负载平衡 调度问题 可以在任何进程执行之前尝试静态负载平衡 Some potential static load balancing techniques: • Round robin algorithm — passes out tasks in sequential order of processes coming back to the first when all processes have been given a task • Randomized algorithms — selects processes at random to take tasks • Recursive bisection — recursively divides the problem into sub-problems of equal computational effort while minimizing message passing • Simulated annealing — an optimization technique • Genetic algorithm — another optimization technique 静态平衡的几个根本缺陷: • 如果实际上不执行程序的各个部分,很难去估计 程序各个部分的执行时间. • 在不同情况下,通信延迟也不一样。 • 有些问题的求解步数不确定。 动态负载平衡 实时调度 在进程执行过程中考虑动态负载平衡。 考虑前面提到的因素, 根据正在运行的各部分来划 分各部分的负载。 会造成额外的开销,但比静态的更高效。 Processes and Processors Computation will be divided into work (工作)or tasks ( 任 务 ) to be performed, and processes perform these tasks. Processes are mapped onto processors. Since our objective is to keep the processors busy, we are interested in the activity of the processors. Dynamic Load Balancing Can be classified as: • Centralized集中式 • Decentralized分散式 Centralized dynamic load balancing Tasks 集中在一起从一个中心出发, Masterslave structure. Decentralized dynamic load balancing 任务在任意进程间传递. 一组工作进程(worker processes) 对问题进行工作, 它们之间相互交互,最后向一个独立进程报告。 一个工作进程可以从其他工作进程处接收任务,也 可以向其他工作进程发送任务。 集中式动态负载平衡 Master process持有要执行的任务集合。 任务发到从进程. 从进程完成一个任务后, 会主动向 主进程请求另一个任务. work pool, replicated worker, processor farm. 集中式工作池 终止条件 当下面两项都满足时计算就终止: • 任务队列是空的 and • 每个从进程为空闲状态,且已经请求了另一个任务, 而没有任何新的任务产生。 队列空不是计算终止的充分条件:如果有进程还在 运行,这些运行的进程可能为队列产生新的任务。 Decentralized Dynamic Load Balancing 分布式工作池 Fully Distributed Work Pool 全分布式工作池:进程间相互执行任务。 Task Transfer Mechanisms 任务传递机制 Receiver-Initiated Method接收器启动法 一个进程向它选择的别的进程请求任务。 特别是,当一个有很少或没有任务可执行时,会向其他 进程请求任务。 该方法在系统负载很重时会工作的很好! 缺点:确定进程的负载代价昂贵。 Sender-initiated method 发送器启动方法: 一个进程向它选择的别的进程发送任务。 一个负载很重的的进程向其他进程传递一些 它的任务。 该方法在系统负载较轻时会工作的很好! 另一种选择是将两种方法结合起来. 轻负载进程向重 负载进程请求任务,重负载进程向轻负载进程发送 任务。缺点:确定进程的负载代价昂贵。 注意:在系统负载很重时,由于缺少可用进程,负 载平衡也可能会很难实现。 从进程间请求任务的分散式选择算法 本地进程选择算法 进程选择算法 Algorithms for selecting a process: Round robin algorithm(循环算法) – process Pi requests tasks from process Px, where x is given by a counter that is incremented after each request, using modulo n arithmetic (n processes), excluding x = i. Random polling algorithm(随机轮询) – process Pi requests tasks from process Px, where x is a number that is selected randomly between 0 and n - 1 (excluding i). 使用线性结构的负载平衡 任 务 队 列 Master process (P0) feeds queue with tasks at one end, and tasks are shifted down queue. When a process, Pi (1 <= i < n), detects a task at its input from queue and process is idle, it takes task from queue. Then tasks to left shuffle down queue so that space held by task is filled. A new task is inserted into e left side end of queue. Eventually, all processes have a task and queue filled with new tasks. High-priority or larger tasks could be placed in queue first. Shifting Actions Could be orchestrated by using messages between adjacent processes: • For left and right communication • For the current task Code Using Time Sharing Between Communication and Computation Master process (P0) Process Pi (1 < i < n) Nonblocking nrecv() necessary to check for a request being received from right. Nonblocking Receive Routines MPI Nonblocking receive, MPI_Irecv(), returns a request “handle,” which is used in subsequent completion routines to wait for the message or to establish whether message has actually been received at that point (MPI_Wait() and MPI_Test(), respectively). In effect, nonblocking receive, MPI_Irecv(), posts a request for message and returns immediately. 使用树结构的负载平衡 Tasks passed from node into one of the two nodes below it when node buffer empty. 分布式任务终止检测算法 Termination Conditions 在时刻t满足下面条件使分布式任务终止 • Application-specific local termination conditions exist throughout the collection of processes, at time t. • There are no messages in transit between processes at time t. 第二条件是必要的,因为传送中的消息也许会重新启动一个已经 终止的进程。 第二条件较难识别,消息在进程间传递的时间预先是不知道的。 通用的分布式终止算法 Each process in one of two states: 1. Inactive - without any task to perform 2. Active Process that sent task to make a process enter the active state becomes its “parent.” 1.进程开始:进程处于inactive状态; 2.收到第一个任务:该进程处于active状态,发送该任务 的进程为“父进程” 3.收到第二个及以上任务:收到任务就马上发送确认消息. 当该进程准备变成inactive状态时才向“父进程” 发送确 认消息。满足下列条件才能进入inactive状态: • 其本地终止条件已满足 (所有分得的任务已经完成), • 它对收到的所有任务发送了确认。 • 它收到了它发出去的任务的确认。 一个进程必须在父进程之前变成inactive,当第一个进程 空闲时计算终止. Termination using message acknowledgments用消息确认实现终止 Ring Termination Algorithms环形终止算法 Single-pass ring termination algorithm 单通 环形终止算法 1. When P0 terminated, it generates token(令牌) passed to P1. 2. When Pi (1 <=i < n) receives token and has already terminated, it passes token onward to Pi+1. Otherwise, it waits for its local termination condition and then passes token onward. Pn-1 passes token to P0. 3. When P0 receives a token, it knows that all processes in the ring have terminated. A message can then be sent to all processes informing them of global termination, if necessary. Ring termination detection algorithm 传递令牌 Algorithm assumes that a process cannot be reactivated after reaching its local termination condition. Does not apply to work pool problems in which a process can pass a new task to an idle process 算法假定一个进程到达本地终止条件后不能被重新激活,这种假设不适 用工作池问题,在工作池问题中一个进程可向一空闲进程传送一个新的 任务。 Dual-Pass Ring Termination Algorithm 双通环形终止算法 双通环形能够处理进程重新激活的问题 如果令牌已经经过Pj 现在已经传给Pi ( j < i ),此时Pi继 续往下传,但同时Pi又回传给Pj一个任务,Pj被重新激活 了。若发生这种情况,令牌必须第二次沿着环巡回。 为了区别这种情形,令牌被分成白色和黑色两种,进程也 分成白色和黑色两种。 接到黑色令牌意味着全局终止还没发生,令牌必须沿着环 重新流动. Passing task to previous processes 发白色令牌 白色令牌 黑色令牌 黑色进程, 把黑色令牌传出去后变白色进程 黑色进程会把令牌染成黑色,白色进程会让令牌以原来的颜色前传。 P0收到黑色令牌,它就发一个白色令牌,若收到白色令牌,则stop 树形终止算法 传递令牌 令牌1 令牌2 若有m个叶子,根root就得收到m个令牌,然后从root通过树广播 给所有进程,才能全局终止。 Fixed Energy Distributed Termination Algorithm 固定能量分布式终止算法 固定数值的 “energy”,相当带数值的令牌。 • 系统开始时,所有能量由主进程持有. • 主进程把部分能量与任务传送给请求任务的进程. • 如果这些进程收到任务请求,它会把能量进一步划分传送下 去。 • 若进程空闲,它要在请求新的任务之前把持有的能量返回。 • 一个进程只有在它发出的能量均已返回,并且组合到所持有 的总能量中后,才会交回其能量。 • 当所有能量返回到主进程,且主进程空闲时,所有进程必定 空闲,计算终止。 严重缺陷 –计算精度可能导致各部分能量之和不等于原始总能量. Load balancing/termination detection Example举例 Shortest Path Problem最短路问题 Finding the shortest distance between two points on a graph. It can be stated as follows: Given a set of interconnected nodes where the links between the nodes are marked with “weights,” find the path from one specific node to another specific node that has the smallest accumulated weights. The interconnected nodes can be described by a graph. The nodes are called vertices, and the links are called edges. If the edges have implied directions (that is, an edge can only be traversed in one direction, the graph is a directed graph. Example: The Best Way to Climb a Mountain Graph of mountain climb Weights in graph indicate amount of effort that would be expended in traversing the route between two connected camp sites. The effort in one direction may be different from the effort in the opposite direction (downhill instead of uphill!). (directed graph) Graph Representation Two basic ways that a graph can be represented in a program: 1. Adjacency matrix(邻接矩阵) — a two-dimensional array, a, in which a[i][j] holds the weight associated with the edge between vertex i and vertex j if one exists 2. Adjacency list (邻接表) — for each vertex, a list of vertices directly connected to the vertex by an edge and the corresponding weights associated with the edges Adjacency matrix used for dense graphs. Adjacency list used for sparse graphs. Difference based upon space (storage) requirements. Accessing the adjacency list is slower than accessing the adjacency matrix. Representing the graph Searching a Graph Two well-known single-source shortest-path algorithms: • Moore’s single-source shortest-path algorithm (Moore, 1957) • Dijkstra’s single-source shortest-path algorithm (Dijkstra, 1959) which are similar. Moore’s algorithm is chosen because it is more amenable to parallel implementation although it may do more work. The weights must all be positive values for the algorithm to work. Moore’s Algorithm 动态规划思想的算法 迭代算法:从源顶点开始, 若当前迭代考虑顶点i时,找出从源顶 点经过顶点i到其他各顶点的最短距离 当前从源点出发到j点的最短距离dj,wi,j 是顶点 i到顶点 j的边权: 改变最短距离如下 dj = min(dj, di + wi,j) Moore’s Shortest-path Algorithm Moore’s Shortest-path Algorithm dj j wij 源点 i 终点 初始队列 队列 当前队列 每次迭代距离被替换的顶点j需要加到队列中 队列中不能有终点 Data Structures 因为是有向搜索,故建立个先进先出的队列(vertex queue)来保存要检查的顶点. 初始, 只有源顶点在队列中 当前从源顶点到顶点i的最短距离保存在 array dist[i]. 初始 赋值“无穷大” Code Suppose w[i][j] holds the weight of the edge from vertex i and vertex j (infinity if no edge). The code could be of the form newdist_j = dist[i] + w[i][j]; if (newdist_j < dist[j]) dist[j] = newdist_j; When a shorter distance is found to vertex j, vertex j is added to the queue (if not already in the queue), which will cause vertex j to be examined again Stages in Searching a Graph Example The initial values of the two key data structures are 检查从顶点A出发的每条边 AB A出队列,更新A到B的最短距离,把B加到队列 AB 检查B发出的边 BF, BE, BD, BC:: B出队列,更新A到F,E,D,C的最短距离,把除了终点F外的E,D, C加入队列 AF=min(AF,AB+BF)=min(∞,10+51)=61 AE=min(AE,AB+BE)=min( ∞,10+24)=34 AD=min(AD,AB+BD) AC=min (AC, AB+BD) AB ABC ABD ABE ABF 检查E发出的所有边 EF E出队列,更新A到F的最短距离,终点F不加入队列 51 AB ABC ABD ABE ABEF AF=min(AF, AE+EF)=min(61,34+17)=51 检查D发出的所有边DE: D出队列,计算A到E的最短距离,更新A到E距离,把E加到队列 51 AE=min(AE,AD+DE)=min(34,23+9)=32 AB ABC ABD ABDE ABEF 再检查C发出的所有边CD, AD=min(AD, AC+CD)=min(23,18+14)=23, C出队列,但AD没有更新还是原来的AD,D不进队列。 再一遍检测E发出的边EF,E出队列,计算AF=min (AF,AE+EF)=min(51,32+17)=49,更新AF AB ABC ABD ABDE ABDEF No more vertices to consider. Have minimum distance from vertex A to each of the other vertices, including destination vertex, F. Usually, path required in addition to distance. Then, path stored as distances recorded. Path in our case is A -> B -> D -> E ->F. Sequential Code Let next_vertex() return the next vertex from the vertex queue or no_vertex if none. Assume that adjacency matrix used, named w[ ][ ]. Parallel Implementations Centralized Work Pool Centralized work pool holds vertex queue, vertex_queue[] as tasks. Each slave takes vertices from vertex queue and returns new vertices. Since the structure holding the graph weights is fixed, this structure could be copied into each slave, say a copied adjacency matrix. Decentralized Work Pool 任务队列vertex_queue[]和dist[]可以是分布式的。每个进程i对应 一个顶点i,该进程存储该顶点的顶点队列项(从该顶点出发的 邻点),及当前 源顶点到该顶点i的最短距离。 每个进程通过存储邻接矩阵w[i][]来找邻点和边权 . Search Algorithm 若进程i,当前最短距离dist,对于它的一个i到j的边,计算 dj=di+w[i][j];就把dj传给进程j, 进程j收到dj后,与自己的最短距离dist比较,若dj<dist,就更新 dist=dj,并且计算从j发出的所有边jk,jl, jh,……将dk,dl, dh….发给进程k,l,h,…….. 若dj>=dist,不更新 Distributed graph search Mechanism necessary to repeat actions and terminate when all processes idle - must cope with messages in transit. Simplest solution Use synchronous message passing, in which a process cannot proceed until destination has received message. Process only active after its vertex is placed on queue. Possible for many processes to be inactive, leading to an inefficient solution. Impractical for a large graph if one vertex is allocated to each processor. Group of vertices could be allocated to each processor. Example: Floyd algorithm 64 Slides for Parallel Programming Techniques & Applications Using Networked Workstations & Parallel Computers 2nd ed., by B. Wilkinson & M. Allen, @ 2004 Pearson Education Inc. All rights reserved.