Transcript Document

Principles of Parallel Algorithm
Design
Carl Tropper
Department of Computer Science
What has to be done
• Identify concurrency in program
• Map concurrent pieces to parallel
processes
• Distribute input, output and intermediate
data
• Manage accesses to shared data by
processors
• Synchronize processors as program
executes
Vocabulary
• Tasks
• Task Dependency graph
Matrix vector multiplication
Database Query
• Model =civic and year=2001 and
(color=green or color=white)
Data Dependencies
Another graph
Task talk
• Task Granularity
• Fine grained, coarse grained
• Degree of concurrency
• Average degree-average number of tasks which can run in
parallel
• Maximum degree
• Critical path
• Length-sum of the weights of the nodes on the path
• Average degree of concurrency=total work/length
Task interaction graph
• Nodes are tasks
• Edges indicate interaction of tasks
• Task dependency graph a subset of task
interaction graph
Sparse matrix multiplication
• Tasks compute entries of output vector
• Task i owns row i and b(i)
• Task i sends non zero elements of row i to
other tasks which need them
Sparse matrix task interaction graph
Process Mapping
Goals and illusions
• Goals
• Maximize concurrency by mapping independent
tasks to different processors
• Minimize completion time by having a process
ready on the critical path when a task is ready
• Map processes which communicate a lot to same
processor
• Illusions
• Can’t do all of the above-they conflict
Task Decomposition
• Big idea
• First decompose for message passing
• Then decompose for the shared memory on each
node
• Decomposition Techniques
•
•
•
•
Recursive
Data
Exploratory
Speculative
Recursive Decomposition
• Good for problems which are amenable to
a divide and conquer strategy
• Quicksort - a natural fit
Quicksort Task Dependency Graph
Sometimes we force the issue
We re-cast the problem into divide and
conquer paradigm
Data Decomposition
• Idea-partitioning of data leads to tasks
• Can partition
•
•
•
•
Output data
Input data
Intermediate data
Whatever………………….
Partitioning Output Data
Each element of the output is computed
independently as a function of the input
Other decompositions
Output data again
Frequency of itemsets
Partition Input Data
• Sometimes more natural thing to do
• Sum of n numbers-only have one output
•
•
•
•
Divide input into groups
One task per group
Get intermediate results
Create one task to combine intermediate
results
Top-partition input
Bottom-partition input and output
Partitioning of Intermediate Data
• Good for multi-stage algorithms
• May improve concurrency over a strictly
input or strictly output partition
Matrix Multiply Again
Concurrency Picture
• Max concurrency of 8 vs
• Max concurrency of 4 for output partition
• Price is storage for D
Exploratory Decomposition
• For search space type problems
• Partition search space into small parts
• Look for solution in each part
Search Space Problem
The15 puzzle
Decomposition
Parallel vs serial-Is it worth it?
It depends on where you find the answer
Speculative Decomposition
• Computation gambles at a branch point in
the program
• Takes path before it knows result
• Win big or waste
Speculative Example
Parallel discrete event simulation
• Idea: Compute results at c,d,e before output from a is known
Hybrid
• Sometimes better to put two ideas
together
Hybrid
• Quicksort - Recursion results in O(n)
tasks, little concurrency.
• First decompose, then recurse (a poem)
Mapping
Tasks and their interactions influence choice
of mapping scheme
Task Characteristics
Task generation
Static- know all tasks before algorithm executes
• Data decomposition leads to static generation
Dynamic-runtime
• Recursive decomposition leads to dynamic
generation
• Quicksort
Task Characteristics
• Task sizes
• Uniform, non-uniform
• Knowledge of task sizes
– 15 puzzle: don’t know task sizes
– Matrix multiplication: do know task sizes
• Size of data associated with tasks
• Big data can cause big communication
Task interactions
• Tasks share data, synchronization
information, work
• Static vs dynamic
• Static-know task interaction graph and when
interactions happen before execution
– Parallel matrix multiply
• Dynamic
– 15 puzzle problem
More interactions
• Regular versus irregular
•
•
•
•
Interaction may have structure which can be used
Regular: image dithering
Irregular: sparse matrix multiplication
Access pattern for b depends on structure of A
Image dithering
Data sharing
• Read only- parallel matrix multiply
• Read-write
– 15 puzzle
• Heuristic search:estimate number of moves to solution from
each state
• Use priority queue to store states to be expanded
• Priority queue contains shared data
Task interactions
• One way
– Read only
• Two way
– Producer consumer style
– Read-write (15 puzzle)
Mapping tasks to processes
Goal
Reduce overhead caused by parallel execution
So
• Reduce communication between processes
• Minimize task idling
– Need to balance the load
• But these goals can conflict
Balancing load is not always enough to avoid idling
Task dependencies get in the way
Processes 9-12 can’t proceed until 1-8 finish
MORAL: Include task dependency information in mapping
Mappings can be
• Static-distribute tasks before algorithm
executes
• Depends on task size, size of data, task
interactions
• NP complete for non-uniform tasks
• Dynamic-distribute tasks during algorithm
execution
• Easier with shared memory
Static Mapping
• Data partitioning
• Results in task decomposition
• Arrays, graphs common ways to represent data
• Task partitioning
• Task dependency graph is static
• Know task sizes
Array Distribution
• Block distribution
• Each process gets contiguous entries
• Good if computation of array element requires
nearby elements
• Load imbalance if different blocks do different
amounts of work
• Block cyclic and cyclic distributions used
to redress load imbalances
Block distribution of matrix
Block decomposition of matrix
C=A x B
Block Decomposition
Higher dimension partitions
• More concurrency
• up to n2 processes for 2D mapping vs n
processes for 1D mapping
• Reduces amount of interaction between
processes
• 1D C requires all of B for each product
• 2D C requires part of B
Graph Partitioning
• Array algorithms good for dense matrices,
structured interaction patterns
• Many algorithms
• operate on sparse data structures
• Interaction of data elements is irregular and data dependent
• Numerical simulations of physical phenomena
• Are important
• Have these characteristics
• Use mesh-each point represents something
physical
Lake Superior Mesh
Random distribution
Balance the Load
Equalize number of edges crossing partitions
Task Partitioning
• Map task dependency graph onto
processes
• Optimal mapping problem is NP complete
• Different choices for mapping
Binary Tree task dependency graph
• Happens for recursive algorithms-compute minimum
of list of numbers
• Map onto hypercube of processes
Naïve task mapping
Better mapping
C’s contain fewer elements
Hierarchical Mapping
• Load imbalance can occur using task dependency graphs (binary
tree) Quicksort benefits
Hierarchical Mapping
• Sparse matrix factorization
• High levels guided by task dependency
graph, called the elimination graph
• Low level tasks use data decomposition
because computation happens later
Dynamic Mapping
• Why? Dynamic task dependency graph
• Two flavors
– Centralized: Keep tasks in central data
structure or are looked after by one process
– Distributed: Processes exchange tasks at run
time
Centralized
• Example: Sort each row of array by quicksort
• Problem: each row can take a different amount of time to
sort
• Solution: Self scheduling-maintain list of un-sorted rows.
Idle process picks from list.
• Problem: work queue becomes bottleneck
• Solution: Chunk scheduling-assign multiple tasks to
process
• Problem: Chunk size is too large. Load imbalance
• Solution: Decrease size of chunk as computation
proceeds
Distributed Schemes
• The Four Questions
–
–
–
–
How do I measure the load on a task?
To whom do I send?
How much do I send?
When do I send?
• Can tolerate smaller granularity on shared
memory then on distributed memory machines.
Tricks to reduce overhead of
process interaction
• Maximize data locality
• Minimize use of nonlocal data, minimize frequency
of access, maximize reuse of recently accessed
data
• Minimize volume of data exchange
• Use mapping scheme, e.g. 2 dimensional mapping
vs 1 dimensional mapping
• Use local data to store intermediate results, and
access shared data once, e.g. break dot product of
two vectors into p partial sums
Minimize frequency of
interactions
• High startup cost associated with each interaction,so try to
access large amounts of data
• Try for spatial locality-keep memory which is accessed
consecutively close together
• Pack lots of data into a message in a message passing.
• Reduce number of cache lines fetched from shared memory.
• Example-repeated sparse matrix vector multiplication. Same
matrix, but different data. Each process gets it needs from
other processes prior to multiplication.
Hot spots
• Hot spots happen-processes transmit
over same link, access same data
• Sometimes can to re-arrange the
computation to avoid the hot-spot
Example C=AxB
• Ci,j=∑k=0Ai,k
Overlapping Computations with
Interactions
• Try to do interaction before computation
(static interaction pattern helps)
• Multiple tasks on same process. If one
blocks, another can execute
• Need support from OS, hardware,
programming paradigm
• Disjoint memory, message passing architectures
• Shared address space-prefetching hardware
Tricks-Replicating Data and/or
computations
• Replicate in each process
• Frequent read only operations can make it
worthwhile
• Mostly for distributed memory machines-shared
memory machines have cache
Optimize Heavy Duty Operations
• Operations
• Access data
• Communications intensive computations
• Synchronization
• Algorithms and libraries exist
• Algorithms-Discuss them soon
• Libaries-MPI
Tricks-overlapping interactions
Parallel Algorithm Models
or
Recipes for decomposing, mapping and minimizing
• Data parallel
•
•
•
•
Static mapping of tasks to processes
Each task does the same thing to different data
Phases - computation followed by synchronization
Message passing architecture more amenable to
this style then shared memory architecture
Recipes
• Task Graph Model
• Used when amount of data is large relative to the
computation on the data
• Used with
– divide and conquer algorithms
– Parallel quicksort
– Sparse matrix factorization
Recipes
• Work pool model
– Any task can be executed by any process
– Dynamic mapping of tasks to processes
– Examples
• Parallelization of loops by chunk scheduling
• Parallel tree search
Recipes
• Master-slave model
• Dictator gives work to students
• Hierarchical master-slave model
• Pipeline
•
•
•
•
Stream of data passed through processes
Producers followed by consumers
General graph, not just a linear array
Example-Parallel LU factorization (later)
• Hybrid