Diapositiva 1

Download Report

Transcript Diapositiva 1

An Introduction To
PARALLEL PROGRAMMING
Ing. Andrea Marongiu
([email protected])
The Multicore Revolution is Here!

More instruction-level parallelism hard to find



Clock frequency scaling is slowing drastically


Better to design small local units with short paths
Effective use of billions of transistors


Too much power and heat when pushing envelope
Cannot communicate across chip fast enough


Very complex designs needed for small gain
Thread-level parallelism appears live and well
Easier to reuse a basic unit many times
Potential for very easy scaling

Just keep adding processors/cores for higher (peak) performance
Vocabulary in the Multi Era


AMP, Assymetric MP: Each
processor has local memory,
tasks statically allocated to
one processor
SMP, Shared-Memory MP:
Processors share memory, tasks
dynamically scheduled to any
processor
Vocabulary in the Multi Era


Heterogeneous:
Specialization among
processors. Often different
instruction sets. Usually AMP
design.
Homogeneous: all processors
have the same instruction set,
can run any task, usually SMP
design.
Future Embedded Systems
The First Software Crisis


60’s and 70’s:
PROBLEM: Assembly Language Programming
 Need
to get abstraction and portability without losing
performance

SOLUTION: High-level Languages (Fortran and C)
 Provided
“common machine language” for
uniprocessors
The Second Software Crisis


80’s and 90’s:
PROBLEM: Inability to build and maintain complex
and robust applications requiring multi-million lines
of code developed by hundred programmers
 Need

to composability, malleability and maintainability
SOLUTION: Object-Oriented Programming (C++
and Java)
 Better
tools and software engineering methodology
(design patterns, specification, testing)
The Third Software Crisis


Today:
PROBLEM: Solid boundary between hardware and
software
 High-level
languages abstract away the hardware
 Sequential performance is left behind by Moore’s Law

SOLUTION: What’s under the hood?
 Language
features for architectural awareness
The Software becomes the Problem, AGAIN


Parallelism required to gain performance

Parallel hardware is “easy” to design

Parallel software is (very) hard to write
Fundamentally hard to grasp true concurrency


Especially in complex software environments
Existing software assumes single-processor

Might break in new and interesting ways

Multitasking no guarantee to run on multiprocessor
Parallel Programming Principles





Coverage (Amdahl’s Law)
Communication/Synchronization
Granularity
Load Balance
Locality
Coverage

More, less powerful
(and power-hungry)
cores to achieve same
performance?
Coverage


Amdahl's Law: The performance improvement to be gained from
using some faster mode of execution is limited by the fraction of the
time the faster mode can be used.
Speedup = old running time / new running time
= 100 seconds / 60 seconds = 1.67
Amdahl’s Law


p = fraction of work that can be parallelized
n = the number of processors
Implications of Amdahl’s Law


Speedup tends to 1/(1-p)
as number of processors
tends to infinity
Parallel programming is
worthwhile when
programs have a lot of
work that is parallel in
nature
Overhead
Overhead of Parallelism


Given enough parallel work, this is the biggest barrier to
getting desired speedup
Parallelism overheads include:





cost of starting a thread or process
cost of communicating shared data
cost of synchronizing
extra (redundant) computation
Tradeoff: Algorithm needs sufficiently large units of work to
run fast in parallel (I.e. large granularity), but not so large that
there is not enough parallel work
Parallel Programming Principles





Coverage (Amdahl’s Law)
Communication/Synchronization
Granularity
Load Balance
Locality
Communication/Synchronization



Only few programs are “embarassingly” parallel
Programs have sequential parts and parallel parts
Need to orchestrate parallel execution among
processors
 Synchronize
threads to make sure dependencies in the
program are preserved
 Communicate results among threads to ensure a
consistent view of data being processed
Communication/Synchronization

Shared Memory




Distributed memory
Communication is implicit. One
 Communication is explicit
copy of data shared among
through messages
many threads
 Cores access local memory
Atomicity, locking and
 Data distribution and
synchronization essential for
communication orchestration is
Overheadessential for performance
correctness
Synchronization is typically in
 Synchronization is implicit in
the form of a global barrier
messages
Parallel Programming Principles





Coverage (Amdahl’s Law)
Communication/Synchronization
Granularity
Load Balance
Locality
Granularity


Granularity is a qualitative measure of the ratio of
computation to communication
Computation stages are typically separated from
periods of communication by synchronization events
Granularity

Fine-grain Parallelism




Low computation to
communication ratio
Small amounts of
computational work between
communication stages
Less opportunity for
performance enhancement
High communication overhead

Coarse-grain Parallelism




High computation to
communication ratio
Large amounts of computational
work between communication
events
More opportunity for
performance increase
Harder to load balance
efficiently
Parallel Programming Principles





Coverage (Amdahl’s Law)
Communication/Synchronization
Granularity
Load Balance
Locality
The Load Balancing Problem

Processors that finish early have to wait for the processor with
the largest amount of work to complete


Leads to idle time, lowers utilization
Particularly urgent with barrier synchronization
UNBALANCED
workloads
BALANCED
workloads
Slowest core dictates overall execution time
Static Load Balancing


Programmer make decisions and assigns a fixed amount of
work to each processing core a priori
Works well for homogeneous multicores



All core are the same
Each core has an equal amount of work
Not so well for heterogeneous multicore


Some cores may be faster than others
Work distribution is uneven
Dynamic Load Balancing



Workload is partitioned in small tasks. Available tasks for
processing are pushed in a work-queue
When one core finishes its allocated task, it takes on further
work from the queue. The process continues until all tasks are
assigned to some core for processing.
Ideal for codes where work is uneven, and in heterogeneous
multicore
Parallel Programming Principles





Coverage (Amdahl’s Law)
Communication/Synchronization
Granularity
Load Balance
Locality
Memory Access Latency

Uniform Memory Access (UMA) – Shared Memory



Centrally located shared memory
All processors are equidistant (access times)
Non-Uniform Access (NUMA)

Shared memory – Processors have the same address space  data is
directly accessible by all, but cost depends on the distance


Placement of data affects performance
Distributed memory – Processors have private address spaces 
Data access is local, but cost of messages depends on the distance

Communication must be efficiently architected
Locality of Memory Accesses
(UMA Shared Memory)

Parallel computation is
serialized due to memory
contention and lack of
bandwidth
Locality of Memory Accesses
(UMA Shared Memory)

Distribute data to relieve
contention and increase
effective bandwidth
Locality of Memory Accesses
(NUMA Shared Memory)
int main()
{
SPM
SPM
SPM
SPM
CPU1
CPU2
CPU2
CPU2
/* Task 1 */
for (i = 0; i < n; i++)
A[i][rand()] = foo ();
/* Task 2 */
for (j = 0; j < n; j++)
B[j] = goo ();
INTERCONNECT
}
Once parallel tasks have been assigned to
different processors..
SHARED
MEMORY
Locality of Memory Accesses
(NUMA Shared Memory)
..phisical placement of data can have a great impact on performance!
int main()
{
SPM
SPM
SPM
SPM
CPU1
CPU2
CPU2
CPU2
/* Task 1 */
for (i = 0; i < n; i++)
A[i][rand()] = foo ();
/* Task 2 */
for (j = 0; j < n; j++)
INTERCONNECT
B[j] = goo ();
}
A
B
SHARED
MEMORY
Locality of Memory Accesses
(NUMA Shared Memory)
int main()
{
SPM
SPM
SPM
SPM
CPU1
CPU2
CPU2
CPU2
/* Task 1 */
for (i = 0; i < n; i++)
A[i][rand()] = foo ();
/* Task 2 */
for (j = 0; j < n; j++)
B[j] = goo ();
INTERCONNECT
}
SHARED
MEMORY
Locality of Memory Accesses
(NUMA Shared Memory)
int main()
{
SPM
SPM
SPM
SPM
CPU1
CPU2
CPU2
CPU2
/* Task 1 */
for (i = 0; i < n; i++)
A[i][rand()] = foo ();
/* Task 2 */
for (j = 0; j < n; j++)
B[j] = goo ();
INTERCONNECT
}
SHARED
MEMORY
Locality in Communication
(Message Passing)