Transcript Document
Principles of Parallel Algorithm Design Carl Tropper Department of Computer Science What has to be done • Identify concurrency in program • Map concurrent pieces to parallel processes • Distribute input, output and intermediate data • Manage accesses to shared data by processors • Synchronize processors as program executes Vocabulary • Tasks • Task Dependency graph Matrix vector multiplication Database Query • Model =civic and year=2001 and (color=green or color=white) Data Dependencies Another graph Task talk • Task Granularity • Fine grained, coarse grained • Degree of concurrency • Average degree-average number of tasks which can run in parallel • Maximum degree • Critical path • Length-sum of the weights of the nodes on the path • Average degree of concurrency=total work/length Task interaction graph • Nodes are tasks • Edges indicate interaction of tasks • Task dependency graph a subset of task interaction graph Sparse matrix multiplication • Tasks compute entries of output vector • Task i owns row i and b(i) • Task i sends non zero elements of row i to other tasks which need them Sparse matrix task interaction graph Process Mapping Goals and illusions • Goals • Maximize concurrency by mapping independent tasks to different processors • Minimize completion time by having a process ready on the critical path when a task is ready • Map processes which communicate a lot to same processor • Illusions • Can’t do all of the above-they conflict Task Decomposition • Big idea • First decompose for message passing • Then decompose for the shared memory on each node • Decomposition Techniques • • • • Recursive Data Exploratory Speculative Recursive Decomposition • Good for problems which are amenable to a divide and conquer strategy • Quicksort - a natural fit Quicksort Task Dependency Graph Sometimes we force the issue We re-cast the problem into divide and conquer paradigm Data Decomposition • Idea-partitioning of data leads to tasks • Can partition • • • • Output data Input data Intermediate data Whatever…………………. Partitioning Output Data Each element of the output is computed independently as a function of the input Other decompositions Output data again Frequency of itemsets Partition Input Data • Sometimes more natural thing to do • Sum of n numbers-only have one output • • • • Divide input into groups One task per group Get intermediate results Create one task to combine intermediate results Top-partition input Bottom-partition input and output Partitioning of Intermediate Data • Good for multi-stage algorithms • May improve concurrency over a strictly input or strictly output partition Matrix Multiply Again Concurrency Picture • Max concurrency of 8 vs • Max concurrency of 4 for output partition • Price is storage for D Exploratory Decomposition • For search space type problems • Partition search space into small parts • Look for solution in each part Search Space Problem The15 puzzle Decomposition Parallel vs serial-Is it worth it? It depends on where you find the answer Speculative Decomposition • Computation gambles at a branch point in the program • Takes path before it knows result • Win big or waste Speculative Example Parallel discrete event simulation • Idea: Compute results at c,d,e before output from a is known Hybrid • Sometimes better to put two ideas together Hybrid • Quicksort - Recursion results in O(n) tasks, little concurrency. • First decompose, then recurse (a poem) Mapping Tasks and their interactions influence choice of mapping scheme Task Characteristics Task generation Static- know all tasks before algorithm executes • Data decomposition leads to static generation Dynamic-runtime • Recursive decomposition leads to dynamic generation • Quicksort Task Characteristics • Task sizes • Uniform, non-uniform • Knowledge of task sizes – 15 puzzle: don’t know task sizes – Matrix multiplication: do know task sizes • Size of data associated with tasks • Big data can cause big communication Task interactions • Tasks share data, synchronization information, work • Static vs dynamic • Static-know task interaction graph and when interactions happen before execution – Parallel matrix multiply • Dynamic – 15 puzzle problem More interactions • Regular versus irregular • • • • Interaction may have structure which can be used Regular: image dithering Irregular: sparse matrix multiplication Access pattern for b depends on structure of A Image dithering Data sharing • Read only- parallel matrix multiply • Read-write – 15 puzzle • Heuristic search:estimate number of moves to solution from each state • Use priority queue to store states to be expanded • Priority queue contains shared data Task interactions • One way – Read only • Two way – Producer consumer style – Read-write (15 puzzle) Mapping tasks to processes Goal Reduce overhead caused by parallel execution So • Reduce communication between processes • Minimize task idling – Need to balance the load • But these goals can conflict Balancing load is not always enough to avoid idling Task dependencies get in the way Processes 9-12 can’t proceed until 1-8 finish MORAL: Include task dependency information in mapping Mappings can be • Static-distribute tasks before algorithm executes • Depends on task size, size of data, task interactions • NP complete for non-uniform tasks • Dynamic-distribute tasks during algorithm execution • Easier with shared memory Static Mapping • Data partitioning • Results in task decomposition • Arrays, graphs common ways to represent data • Task partitioning • Task dependency graph is static • Know task sizes Array Distribution • Block distribution • Each process gets contiguous entries • Good if computation of array element requires nearby elements • Load imbalance if different blocks do different amounts of work • Block cyclic and cyclic distributions used to redress load imbalances Block distribution of matrix Block decomposition of matrix C=A x B Block Decomposition Higher dimension partitions • More concurrency • up to n2 processes for 2D mapping vs n processes for 1D mapping • Reduces amount of interaction between processes • 1D C requires all of B for each product • 2D C requires part of B Graph Partitioning • Array algorithms good for dense matrices, structured interaction patterns • Many algorithms • operate on sparse data structures • Interaction of data elements is irregular and data dependent • Numerical simulations of physical phenomena • Are important • Have these characteristics • Use mesh-each point represents something physical Lake Superior Mesh Random distribution Balance the Load Equalize number of edges crossing partitions Task Partitioning • Map task dependency graph onto processes • Optimal mapping problem is NP complete • Different choices for mapping Binary Tree task dependency graph • Happens for recursive algorithms-compute minimum of list of numbers • Map onto hypercube of processes Naïve task mapping Better mapping C’s contain fewer elements Hierarchical Mapping • Load imbalance can occur using task dependency graphs (binary tree) Quicksort benefits Hierarchical Mapping • Sparse matrix factorization • High levels guided by task dependency graph, called the elimination graph • Low level tasks use data decomposition because computation happens later Dynamic Mapping • Why? Dynamic task dependency graph • Two flavors – Centralized: Keep tasks in central data structure or are looked after by one process – Distributed: Processes exchange tasks at run time Centralized • Example: Sort each row of array by quicksort • Problem: each row can take a different amount of time to sort • Solution: Self scheduling-maintain list of un-sorted rows. Idle process picks from list. • Problem: work queue becomes bottleneck • Solution: Chunk scheduling-assign multiple tasks to process • Problem: Chunk size is too large. Load imbalance • Solution: Decrease size of chunk as computation proceeds Distributed Schemes • The Four Questions – – – – How do I measure the load on a task? To whom do I send? How much do I send? When do I send? • Can tolerate smaller granularity on shared memory then on distributed memory machines. Tricks to reduce overhead of process interaction • Maximize data locality • Minimize use of nonlocal data, minimize frequency of access, maximize reuse of recently accessed data • Minimize volume of data exchange • Use mapping scheme, e.g. 2 dimensional mapping vs 1 dimensional mapping • Use local data to store intermediate results, and access shared data once, e.g. break dot product of two vectors into p partial sums Minimize frequency of interactions • High startup cost associated with each interaction,so try to access large amounts of data • Try for spatial locality-keep memory which is accessed consecutively close together • Pack lots of data into a message in a message passing. • Reduce number of cache lines fetched from shared memory. • Example-repeated sparse matrix vector multiplication. Same matrix, but different data. Each process gets it needs from other processes prior to multiplication. Hot spots • Hot spots happen-processes transmit over same link, access same data • Sometimes can to re-arrange the computation to avoid the hot-spot Example C=AxB • Ci,j=∑k=0Ai,k Overlapping Computations with Interactions • Try to do interaction before computation (static interaction pattern helps) • Multiple tasks on same process. If one blocks, another can execute • Need support from OS, hardware, programming paradigm • Disjoint memory, message passing architectures • Shared address space-prefetching hardware Tricks-Replicating Data and/or computations • Replicate in each process • Frequent read only operations can make it worthwhile • Mostly for distributed memory machines-shared memory machines have cache Optimize Heavy Duty Operations • Operations • Access data • Communications intensive computations • Synchronization • Algorithms and libraries exist • Algorithms-Discuss them soon • Libaries-MPI Tricks-overlapping interactions Parallel Algorithm Models or Recipes for decomposing, mapping and minimizing • Data parallel • • • • Static mapping of tasks to processes Each task does the same thing to different data Phases - computation followed by synchronization Message passing architecture more amenable to this style then shared memory architecture Recipes • Task Graph Model • Used when amount of data is large relative to the computation on the data • Used with – divide and conquer algorithms – Parallel quicksort – Sparse matrix factorization Recipes • Work pool model – Any task can be executed by any process – Dynamic mapping of tasks to processes – Examples • Parallelization of loops by chunk scheduling • Parallel tree search Recipes • Master-slave model • Dictator gives work to students • Hierarchical master-slave model • Pipeline • • • • Stream of data passed through processes Producers followed by consumers General graph, not just a linear array Example-Parallel LU factorization (later) • Hybrid