Presentazione standard di PowerPoint

Download Report

Transcript Presentazione standard di PowerPoint

Shared Memory Programming
Synchronization primitives
Ing. Andrea Marongiu
([email protected])
Includes slides from course CS162 at UC Berkeley, by prof Anthony D. Joseph and Ion Stoica
and from course CS194, by prof. Katherine Yelick
Shared Memory Programming
• Program is a collection of threads of control.
• Can be created dynamically, mid-execution, in some languages
• Each thread has a set of private variables, e.g., local stack variables
• Also a set of shared variables, e.g., static variables, shared common blocks,
or global heap.
• Threads communicate implicitly by writing and reading shared variables.
• Threads coordinate by synchronizing on shared variables
Shared memory
s
s = ...
y = ..s ...
i: 2
P0
i: 5
P1
i: 8
Private
memory
Pn
Shared Memory code for computing a sum
static int s = 0;
Thread 1
for i = 0, n/2-1
s = s + sqr(A[i])
Thread 2
for i = n/2, n-1
s = s + sqr(A[i])
• Problem is a race condition on variable s in the program
• A race condition or data race occurs when:
- two processors (or two threads) access the same variable,
and at least one does a write.
- The accesses are concurrent (not synchronized) so they
could happen simultaneously
Shared Memory code for computing a sum
A 3
5
f = square
static int s = 0;
Thread 1
….
compute f([A[i]) and put in reg0
reg1 = s
reg1 = reg1 + reg0
s = reg1
…
9
0
9
9
Thread 2
…
compute f([A[i]) and put in reg0
reg1 = s
reg1 = reg1 + reg0
s = reg1
…
• Assume A = [3,5], f is the square function, and s=0 initially
• For this program to work, s should be 34 at the end
• but it may be 34,9, or 25
• The atomic operations are reads and writes
• Never see ½ of one number, but no += operation is not atomic
• All computations happen in (private) registers
25
0
25
25
Shared Memory code for computing a sum
static int s = 0;
Thread 1
Thread 2
local_s1= 0
for i = 0, n/2-1
local_s1 = local_s1 + sqr(A[i])
local_s2 = 0
for i = n/2, n-1
local_s2= local_s2 + sqr(A[i])
s = s + local_s1
s = s +local_s2
ATOMIC
• Since addition is associative, it’s OK to rearrange order
• Right?
• Most computation is on private variables
- Sharing frequency is also reduced, which might improve speed
- But there is still a race condition on the update of shared s
ATOMIC
Atomic Operations
• To understand a concurrent program, we need to know what the
underlying indivisible operations are!
• Atomic Operation: an operation that always runs to completion
or not at all
• It is indivisible: it cannot be stopped in the middle and state cannot be
modified by someone else in the middle
• Fundamental building block – if no atomic operations, then have no way
for threads to work together
• On most machines, memory references and assignments (i.e.
loads and stores) of words are atomic
Role of Synchronization
• “A parallel computer is a collection of processing elements
that cooperate and communicate to solve large problems
fast.”
• Types of Synchronization
• Mutual Exclusion
• Event synchronization
• point-to-point
• group
• global (barriers)
• How much hardware support?
Most used forms of
synchronization in shared
memory parallel programming
Motivation: “Too much milk”
• Example: People need to coordinate:
Time
3:00
3:05
3:10
3:15
3:20
3:25
3:30
Person A
Look in Fridge. Out of milk
Leave for store
Arrive at store
Buy milk
Arrive home, put milk away
Person B
Look in Fridge. Out of milk
Leave for store
Arrive at store
Buy milk
Arrive home, put milk away
Definitions
• Synchronization: using atomic operations to ensure
cooperation between threads
• For now, only loads and stores are atomic
• hard to build anything useful with only reads and writes
• Mutual Exclusion: ensuring that only one thread does a
particular thing at a time
• One thread excludes the other while doing its task
• Critical Section: piece of code that only one thread can
execute at once
• Critical section and mutual exclusion are two ways of describing the
same thing
• Critical section defines sharing granularity
More Definitions
• Lock: prevents someone from doing something
• Lock before entering critical section and
before accessing shared data
• Unlock when leaving, after accessing shared data
• Wait if locked
• Important idea: all synchronization involves waiting
• Example: fix the milk problem by putting a lock on refrigerator
• Lock it and take key if you are going to go buy milk
• Fixes too much (coarse granularity): roommate angry if only wants
orange juice
Too Much Milk: Correctness properties
• Need to be careful about correctness of concurrent
programs, since non-deterministic
• Always write down desired behavior first
• think first, then code
• What are the correctness properties for the “Too much
milk” problem?
• Never more than one person buys
• Someone buys if needed
• Restrict ourselves to use only atomic load and store
operations as building blocks
Too Much Milk: Solution #1
• Use a note to avoid buying too much milk:
• Leave a note before buying (kind of “lock”)
• Remove note after buying (kind of “unlock”)
• Don’t buy if note (wait)
• Suppose a computer tries this (remember, only memory
read/write are atomic):
if (noMilk) {
if (noNote) {
leave Note;
buy milk;
remove note;
}
}
• Result?
Too Much Milk: Solution #1
Thread A
if (noMilk)
if (noNote) {
Thread B
if (noMilk)
if (noNote) {
leave Note;
buy milk;
remove note;
}
}
leave Note;
buy milk;
remove note;
}
}
Need to
atomically
update lock
variable
How to Implement Lock?
• Lock: prevents someone from accessing something
• Lock before entering critical section (e.g., before accessing shared data)
• Unlock when leaving, after accessing shared data
• Wait if locked
• Important idea: all synchronization involves waiting
• Should sleep if waiting for long time
• Hardware atomic instructions?
Examples of hardware atomic instructions
• test&set (&address) {
}
result = M[address];
M[address] = 1;
return result;
• swap (&address, register) {
}
temp = M[address];
M[address] = register;
register = temp;
/* most architectures */
/* x86 */
• compare&swap (&address, reg1, reg2) {
}
if (reg1 == M[address]) {
M[address] = reg2;
return success;
} else {
return failure;
}
Atomic
operations!
/* 68000 */
Implementing Locks with test&set
• Simple solution:
int value = 0; // Free
Acquire() {
while (test&set(value)); // while busy
}
test&set (&address) {
Release() {
result = M[address];
M[address] = 1;
value = 0;
return result;
}
}
• Simple explanation:
• If lock is free, test&set reads 0 and sets value=1, so lock is now busy. It returns
0 so while exits
• If lock is busy, test&set reads 1 and sets value=1 (no change). It returns 1, so
while loop continues
• When we set value = 0, someone else can get lock
Too Much Milk: Solution #2
• Lock.Acquire() – wait until lock is free, then grab
• Lock.Release() – unlock, waking up anyone waiting
• atomic operations – if two threads are waiting for the lock, only one succeeds
to grab the lock
• Then, our milk problem is easy:
milklock.Acquire();
if (nomilk)
buy milk;
milklock.Release();
• Once again, section of code between Acquire() and Release()
called a “Critical Section”
Shared Memory code for computing a sum
static int s = 0;
static lock lk;
Thread 1
Thread 2
local_s1= 0
for i = 0, n/2-1
local_s1 = local_s1 + sqr(A[i])
local_s2 = 0
for i = n/2, n-1
local_s2= local_s2 + sqr(A[i])
lock(lk);
s = s + local_s1
unlock(lk);
lock(lk);
s = s +local_s2
unlock(lk);
• Since addition is associative, it’s OK to rearrange order
• Right?
• Most computation is on private variables
- Sharing frequency is also reduced, which might improve speed
- But there is still a race condition on the update of shared s
Performance Criteria for Synch. Ops
• Latency (time per op)
• How long does it take if you always win
• Especially when light contention
• Bandwidth (ops per sec)
• Especially under high contention
• How long does it take (averaged over threads) when many others are
trying for it
• Traffic
• How many events on shared resources (bus, crossbar,…)
• Storage
• How much memory is required?
• Fairness
• Can any one threads be “starved” and never get the lock?
Barriers
• Software algorithms implemented using locks, flags,
counters
• Hardware barriers
• Wired-AND line separate from address/data bus
• Set input high when arrive, wait for output to be high to leave
• In practice, multiple wires to allow reuse
• Useful when barriers are global and very frequent
• Difficult to support arbitrary subset of processors
• even harder with multiple processes per processor
• Difficult to dynamically change number and identity of participants
• e.g. latter due to process migration
• Not common today on bus-based machines
A Simple Centralized Barrier
• Shared counter maintains number of processes that have arrived
• increment when arrive (lock), check until reaches numprocs
• Problem?
struct bar_type {
int counter;
struct lock_type lock
int flag = 0;
} bar_name;
BARRIER (bar_name, p) {
LOCK(bar_name.lock);
if (bar_name.counter == 0)
bar_name.flag = 0;
/* reset flag if first to reach*/
mycount = bar_name.counter++;
/* mycount is private */
UNLOCK(bar_name.lock);
if (mycount == p) {
/* last to arrive */
bar_name.counter = 0;
/* reset for next barrier */
bar_name.flag = 1;
/* release waiters */
}
else while (bar_name.flag == 0) {}; /* busy wait for release */
}
A Working Centralized Barrier
• Consecutively entering the same barrier doesn’t work
• Must prevent process from entering until all have left previous instance
• Could use another counter, but increases latency and contention
• Sense reversal: wait for flag to take different value consecutive times
• Toggle this value only when all processes reach
BARRIER (bar_name, p) {
local_sense = !(local_sense);
/* toggle private sense variable */
LOCK(bar_name.lock);
mycount = bar_name.counter++;
/* mycount is private */
if (bar_name.counter == p)
UNLOCK(bar_name.lock);
bar_name.flag = local_sense;
/* release waiters*/
else
{ UNLOCK(bar_name.lock);
while (bar_name.flag != local_sense) {}; }
}
Centralized Barrier Performance
• Latency
• Centralized has critical path length at least proportional to p
• Traffic
• About 3p bus transactions
• Storage Cost
• Very low: centralized counter and flag
• Fairness
• Same processor should not always be last to exit barrier
• No such bias in centralized
• Key problems for centralized barrier are latency and traffic
• Especially with distributed memory, traffic goes to same node
Improved Barrier Algorithm
Master-Slave barrier
• Master core gathers slaves on the barrier and releases them
• Use separate, per-core polling flags for different wait stages
Contention
Centralized
Master-Slave
• Separate gather and release trees
• Advantage: use of ordinary reads/writes instead of locks (array of flags)
• 2x(p-1) messages exchanged over the network
• Valuable in distributed network: communicate along different paths
Improved Barrier Algorithm
What if implemented on top of NUMA (cluster-based) shared memory system?
• e.g., p2012
MEM
MEM
PROC
PROC
XBAR
MEM
PROC
NI
NI
NI
NI
XBAR
MEM
PROC
XBAR
XBAR
Master-Slave
• Not all messages have same latency
• Need for locality-aware implementation
Improved Barrier Algorithm
Software combining tree
• Only k processors access the same location, where k is degree of tree
Contention
Little contention
Centralized
Tree
• Separate arrival and exit trees, and use sense reversal
• Valuable in distributed network: communicate along different paths
• Higher latency (log p steps of work, and O(p) serialized bus xactions)
• Advantage: use of ordinary reads/writes instead of locks
Improved Barrier Algorithm
Software combining tree
• Only k processors access the same location, where k is degree of tree
Contention
Centralized
Tree
• Separate arrival and exit trees, and use sense reversal
• Valuable in distributed network: communicate along different paths
• Higher latency (log p steps of work, and O(p) serialized bus xactions)
• Advantage: use of ordinary reads/writes instead of locks
Improved Barrier Algorithm
What if implemented on top of NUMA (cluster-based) shared memory system?
• e.g., p2012
MEM
MEM
PROC
PROC
XBAR
MEM
PROC
NI
NI
NI
NI
XBAR
MEM
PROC
XBAR
XBAR
Tree
• Hierarchical synchronization
• locality-aware implementation
Barrier performance
Parallel programming models
• Programming model is made up of the languages and
libraries that create an abstract view of the machine
• Control
• How is parallelism created?
• How is are dependencies (orderings) enforced?
• Data
• Can data be shared or is it all private?
• How is shared data accessed or private data communicated?
• Synchronization
• What operations can be used to coordinate parallelism
• What are the atomic (indivisible) operations?
Parallel programming models
• In this and the upcoming lectures we will see different
programming models and the features that each provide
with respect to
• Control
• Data
• Synchronization
• Pthreads
• OpenMP
• OpenCL