Principles of Parallel Algorithm Design - TR-Grid

Download Report

Transcript Principles of Parallel Algorithm Design - TR-Grid

Principles of
Parallel Algorithm Design
Prof. Dr. Cevdet Aykanat
Bilkent Üniversitesi
Bilgisayar Mühendisliği Bölümü
Principles of
Parallel Algorithm Design
Identifying concurrent tasks
Mapping tasks onto multiple processes
Distributing input, output, intermediate data
Managing access to shared data
Synchronizing processors
Principles of Parallel Algorithm Design
•Identifying concurrent tasks
•Mapping tasks onto multiple processes
•Distributing input, output, intermediate data
•Managing access to shared data
•Synchronizing processors
• several choices for each step
• relatively a few combinations lead to a good parallel algorithm
• different choices yield best performance on
– different parallel architectures
– different parallel programming paradigms
Decomposition, Tasks
• decomosition:
– dividing a computation into smaller parts
– some or all parts can be executed concurrently
• atomic task
– user defined
– indivisible units of computation
– same size or different size
Task Dependence Graphs (TDG)
• directed acyclic graph
• nodes
: atomic tasks
• directed edges : dependencies
– some tasks use data produced by other tasks
• TDG can be weighted:
– node wgt: amount of computation
– edge wgt: amount of data
• multiple ways of expressing certain computations
– different ways of arranging computations
– lead to different TDGs
Granularity, Concurrency
• granularity: number (#) and size of tasks
– fine grain : large # of small tasks
– coarse grain : small # of large tas
• degree of concurrency (DoC):
– # of tasks that can be executed simultaneously
• max DoC : max DoC at any given time
– tree TDGs: max DoC = # of leaves (usually)
• avg DoC : DoC over entire duration
Degree of Concurrency
• depends on granularity
– finer task granularity : larger DoC
– bound on fine granularity of a decomposition
• depends on shape of TDG
– shallow and wide TDG : larger DoC
– deep and thin TDG : smaller DoC
– critical path:
• longest directed path between a start node and a finish node
– critical path length = sum of wgts along the path
– avg DoC = total work / critical path length
Task Interaction Graph (TIG)
• tasks share input, output or intermediate data
• interactions among independent tasks of aTDG
• TIG: pattern of interactions among tasks
– node: task
– edge: connects tasks that interact with each other
• TIG can be weighted:
– node wgt: amount of computation
– edge wgt: amount of interaction
Processes and Mapping
• process vs processor:
– logical computing agents that perform tasks
• mapping: assigning tasks to processes
• conflicting goals in a good mapping
– maximize concurrency
• map independent tasks to different processes
– minimize idle time / interaction overhead
• map tasks along critical path to same process
• map tasks with high interaction to same processes
• e.g., map all tasks to the same process
Decomposition Techiques
recursive decomposition
data decomposition
explaloratory decomposition
speculative decomposition
Recursive Decomposition
divide-and-conquer strategy → natural concurrency
divide problem into a set of independent subproblems
conquer: recursively solve each subproblem
combine: solns to subproblems to a soln of problem
if sequential algorithm is not based on DAC
– restructure computation as a DAC algorithm
– recursive decomposition to extract concurrency
– e.g., finding minimum of an array A of n numbers
Data Decomposition
• partition/decompose computational data domain
• use this partition to induce task decomposition
– tasks: similar operations on different data parts
• partitioning output data
each output can be computed independently as a fn of input
example: block matrix multiplication
data decomposition may not lead to unique task decompsition
another example: computing itemset frequencies
• input: transactions & output: itemset frequencies
Data Decomposition
• partitioning input data
– may not be possible desirable to partition output data
• e.g., finding min, sum of a set of numbers, sorting
a task created for each part of the input data
task: all computations that can be done using local data
a combine step may be needed to combine results of tasks
example: finding the sum of an array A of n numbers
example: computing itemset frequencies
• partitioning both output and input data
– output data partitioning is feasible
– partitioning of input data offers additional concurrency
– example: computing itemset frequencies
Data Decomposition
• partitioning intermediate data
– multistage computations
• partioning input or output data of an intermediate stage
– may lead to higher concurrency
– some restructuring of the algorithm may be needed
– example: block matrix multiplication
• owner computes rule
– each part performs all computations involving data it owns
– input: perform all computations that can be done using local data
– ouput: compute all data in the partition
Other Decomposition Techniques
• exploratory decomposition
search of a configuration space for a solution
partition the search space into smaller parts
search each part concurrently
total parallel work <, =, > total serial work
example: 15-puzzle problem
• speculative decomposition
• hybrid decompositions
computation structured into multiple stages
may apply different decompositions in different stages
examples: finding min of an array and quicksort
data decomposition then recursive decomposition
Characteristics of Tasks
• task generation: static vs dynamic task generation
– static: all tasks are known priori to execution of algorithm
• data decomposition: matrix multiplication
• recursive decomposition: finding min of an array
– dynamic: actual tasks and TPG/TIG not available a priori
• rules, guideliness governing task generation may be known
• recursive decomposition: quicksort
• another example: ray tracing
• task sizes: uniform vs non-uniform
– complexity of mapping depends on this
– tasks in matrix multiplication: uniform
– tasks in quicksort: non-uniform
Characteristics of Tasks
• knowledge of task sizes
– can be used in mapping
– known: tasks in decompositions for matrix multiplication
– unknown: tasks in 15-puzzle problem
• do not know a priori how many moves will lead to a soln.
• size of data associated with tasks
– associated data must be available to the process
– size and location of the associated data
– consider data migration overhead in the mapping
Characteristics of Inter-Task Interactions
• static vs dynamic
– static: pattern and timing of interactions known a priori
– static interaction: decompositions for matrix multiplication
– message-passing paradigm (MPP):
active involvement of both interacting tasks
static interactions easy to program
dynamic interactions harder to program
tasks assigned additional synchronization and polling responsibilities
– shared-address-space (SASP): can handle both equally easily
• regular vs irregular (spatial structure)
– regular: structure that can be exploited for efficient implement.
• structured/curvilinear grids (implicit connectivity)
• image dithering (example)
– irregular: no such regular pattern exists
• unstructured grids (connectivity maintained explicitly)
• SpMxV (sparse matrix vector multiplication)
– irregular and dynamic interactions harder to handle in MPP
Characteristics of Inter-Task Interactions
• read-only vs read-write
read-only: tasks require read-only access to shared data
example: decompositions for matrix multiplication
read-write: tasks need to read and write on shared data
example: heuristic search for 15-puzzle problem
• one-way vs two-way
2-way: data/work needed by a task explicitliy supplied by another
usually involve predefined producer and consumer
1-way: only one of a pair of comm. tasks initiates & completes interaction
read-only → 1-way & read-write → either 1-way or 2-way
SASP can handle both interactions equally easily
MPP cannot handle 1-way interaction directly
• source of data should explicitly send it to the recipient
• static 1-way: easily converted to 2-way via program restructuring
• dynamic 1-way: nontrivial program structuring for converting to 2-way
– polling: task checks for pending requests from others at regular intervals
Mapping Techniques
• minimize overheads of parallel task execution
– overhead: inter-process interaction
– overhead: process idle time (uneven load distribution)
• load balancing
– balanced aggregate load: necessary but not sufficient
– computations & interactions well balanced at each stage
– example: 12-task decomposition (9-12 depends on 1-8)
Static vs Dynamic Mapping
• static: distribute tasks prior to execution
static task generation: either static or dynamic mapping
good mapping: knowledge of task sizes, data sizes, TIG
non-trivial problem (usually NP-hard)
task sizes known but non-uniform
• even if no TDG/TIG → number partitioning problem
• dynamic: distribute workload during execution
– dynamic task generation: dynamic mapping
– task sizes unkown: dynamic mapping more effective
– large data size: dynamic mapping costly (in MPP)
Static-Mapping Schemes
• mapping based on data partitioning
– data partitioning induces a decomposition
– partitioning selected with final mapping in mind
• i.e., p-way data decomposition
– dense arrays
– sparse data structures, graphs (FE meshes)
• mapping based on task partitioning
– task dependence graphs, task interaction graphs
• hierarchical partitioning
– hybrid decomposition and mapping techniques
Array Distribution Schemes
• block distributions: spatial locality of interaction
– each process receives a contigous block of entries
– 1D: each part contains a block of consecutive rows
• i.e., kth part contains rows kn/p ... (k+1)n/p-1
– 2D: checkerboard partitioning
– higher dimensional distributions
• higher degree of concurrency
• less inter-process interaction
• example: matrix multiplication
Array Distribution Schemes
• cyclic distribution
– amount of work differs for different matrix entries
• examples: ray casting, dense LU factorization
• block distribution leads to load imbalance
– all processes have tasks from all parts of the matrix
– good load balance, but complete loss of locality
• block-cyclic distribution
– partition array into more than p blocks
– map blocks to processes in a round-robin (scattered) manner
• randomized block distribution
– when the distribution of work has some special pattern
• adaptive 2D array partitionings
– rectilinear, jagged, orthogonal bisection
Dynamic Mapping Schemes
• centralized schemes
all tasks maintained in a common pool or by a process
idle processes take task(s) from central pool or master process
easier to implement
limited scalability: central pool/process becomes a bottleneck
chunk scheduling: idle processes get group of tasks
• danger of load imbalance due to large chunk sizes
• decrease chunk size as program progresses
– e.g., sorting entries in each row of a matrix
• non-uniform tasks & unknown task sizes
– e.g., image-space parallel ray casting
Dynamic Mapping Schemes
• distributed schemes
– tasks are distributed among processes
– more scalable (no bottleneck)
– critical parameters of distributed load balancing
how sending and receiving processes ard paired?
who initiates the work transfer: sender or receiver?
how much work transferred in each exchange?
when is he work transfer performed?
• suitability to parallel architectures
– both can be implemented in both SAS and MP paradigms
– dynamic schemes require movement of tasks
– computational granularity of tasks should be high in MP systems
Methods for Interaction Overheads
• factors:
– volume and frequency of interaction
– spatial and temporal pattern of interactions
• maximizing data locality
– minimize volume of data exchange
• minimize overall volume of shared data
• similar to maximizing temporal data locality
– minimize frequency of interaction
• high startup cost associated with each interaction
• restructure algorithm: shared data accessed in large pieces
• similar to increasing spatial locality of data access
• minimizing contention and hot spots
– multiple tasks try to access same resource concurrently
• multiple simultaneous access to same memory block/bank
• multiple processes sending messages to same process at the same time
Methods for Interaction Overheads
• minimizing contention and hot spots
– multiple tasks try to access same resource concurrently
• multiple simultaneous access to same memory block/bank
• multiple processes sending messages to same process simult.
– e.g., matrix multiplication based on 2D partitioning
• overlapping computations with interactions
– early initiation of an interaction
– support from programming paradigm, OS, hardware
– MP: non-blocking message-passing primitives
Methods for Interaction Overheads
• replicating data or computation
– replicating frequently accessed read-only shared data
– MP paradigm benefits more from data replication
– replicated computation for shared intermediate results
• using optimized collective interaction operations
– usually use available implementations (e.g., by MPI)
– sometimes, it may be better to write your own procedure
• overlapping interactions with other interactions
– example: one-to-all broadcast
Parallel Algorithm Models
• data-parallel model
– data parallelism: identicial operations applied
concurrently on different data items
• task graph model
– task parallelism: independent tasks in a TDG
– quicksort, sparse matrix factorization
• work-pool or task-pool model
– dynamic mapping of tasks onto processes
– mapping may be centralized or distributed
Parallel Algorithm Models
• master-slave or manager-worker model
– master process generates work & allocates to worker processes
• pipeline or producer-consumer model
– stream parallelism: execution of diff. programs on a data stream
– each process in the pipeline:
• consumer of the sequence of data items for the preceeding process
• producer of data for the process following in the pipeline
– pipeline may not be a linear chain (it can be a DAG)
• hybrid models
– multiple models applied hierarchically
– multiple models applied sequentially to different stages