Transcript Week 8 Power Point Slides
Multiprocessors and Multi-computers • •
Multi-computers
– Distributed address space accessible by local processors – Requires message passing – Programming tends to be more difficult
Multiprocessors
– Single address space accessible by all processors – Simultaneous access to shared variables can produce inconsistent results – Generally programming is more convenient – Doesn’t scale to more than about sixteen processors
Bus
Shared Memory Hardware
Memory Modules Processes Memory modules
Bus configuration Crossbar switch configuration
•
Cache Coherence
Cache Coherence Protocol
–
Write-Update
: All caches immediately updated with altered data –
Write-Invalidate
: Altered data is invalidated in all caches. Updates take place only if subsequently referenced Significantly impacts performance y x Memory •
False Sharing
: Cache updates take place because multiple processes access the same cache block but not the same locations x Processor 1 y Processor 2 Cache Blocks
Note
: Significant because each processor has a local cache
Shared Memory Access • •
Critical Section
– A section of code that needs to be protected from simultaneous access
Mutual Exclusion
– The mechanism used to enforce a critical section – Locks – Semaphores – Monitors – Condition Variables =1 Shared Variable x Process 1 =2 Process 2
Sequential Consistency
Formally defined by Lamport (1979): • A multiprocessor result is sequentially consistent if: – The operations of each individual processors occur in proper sequence specified by its program.
– The overall output matches some sequential order of operations by all the processors • Summary: Arbitrary interleaving of instructions does not affect the output generated.
Deadlock
Resources permanently blocked waiting for needed resources
•
Necessary Conditions
– Circular Wait – Limited Resource – Non-preemptive – Hold and Wait R 1 R 2 R n-1 R n P 1 P 2 P n-1 P n
Deadly Embrace
R 1 R 2 P 1 P 2
Two Process Deadlock
Locks
Locks are the simplest mutual exclusion mechanism Normally, these are provided by operating system calls • • • • •
Single bit variable:
1=locked, 0=unlocked “Enter door and lock the door at entry”
Spin locks (busy wait locks)
– while (lock==1) spin(); // Normally involves hardware support lock = 1; // Critical section lock = 0;
Advantages:
Simple and easy to understand
Disadvantages
– Poor use of the CPU if process does not block while waiting – It’s easy to skip the
lock=0
statement
Examples
: Pthreads and openMP provide OS abstractions
Note
:
The while and lock setting must be atomic
Semaphores
• • Limits concurrent access • An integer variable,
s
, controls the mechanism
Operations
–
P operation
: passeren in Dutch for: to pass s--; while (s<0) wait(); // Critical section code –
V operation
: vrigeven in Dutch for: to release s++; if (s<=0) unblock a waiting process;
p(s); /* Critical section */ v(s);
• • •
Notes
– Set s=1 initially for s to be a binary semaphore which acts like a lock.
– Set s=k>1 initially if k simultaneous entries are possible – Set s=k<=0 for consumer processes waiting to consume data produced
Disadvantage:
Its easy to skip the v operation
Example: UNIX OS
Monitors
• A Class mechanism that limits access to a shared resource public class doIt { public doIt() {//Constructor logic} public synchronized void critMethod() { wait(); // Wait till another thread signals notify(); } } • • •
Advantage:
Most natural mutual exclusive mechanism
Disadvantage:
Requires a language that supports the construct
Examples:
Java, ADA, Modula II
Condition Variables
Mechanism to guarantee a global condition before critical section entry • • • •
Advantages
: – Reduce overhead with checking if a global variable reaches some value – Avoids having to frequently “poll” the global variable
Disadvantage
: Its easy to skip the unlock operations
Example
: Pthreads
Notes
: –
wait()
unlocks and locks
mutex
thrown automatically – Threads must already be waiting for a signal when it is
Example
•
Thread 1
lock(mutex) while (c<>VALUE) wait(cVar,mutex) // Critical section unlock(mutex);
•
Thread 2
if (c==VALUE) signal(condVar)
Shared Memory Programming Alternatives
• Heavyweight processes • Modified syntax of an existing language (HP Fortran) • Programming language designed for parallel processing (ADA) • Compiler extensions to specify parallel execution (OpenMP) • Thread programming standard: Java Threads and pthreads
Threads
Definition: Path of execution through a process • • •
Heavyweight processes
(UNIX fork, wait, waitpid, shmat, shmdt) – Disadvantage: time and memory expensive – Advantage: A blocked process doesn’t block the other processes
Lightweight threads
(pthreads library) – Only needs to share stack space and instruction counter – "Thread Safe" programming required to guarantee consistent results
Pthreads
– Threads can be spawned and started by other threads – They can run independently (detached from their parent thread) or require joins for termination – Formation of thread pools are possible – Threads communicate through signals – Processing order is indeterminate
Forks and Joins
General thread flow of control
pid = fork(); if (pid == 0) { /* Do spawned thread code */ } else { /* Do spawning thread code */ } if (pid == 0) exit(0); else wait(0); Note: Detached processes run independently from its parent without joins
Processes and Threads
IP Code Heap IP Stack Code Heap Stack Listeners Resources IP Listeners Resources Stack
Single Thread Process Dual Thread Process Notes:
• Threads can be three orders of magnitude faster than processes • Thread safe library routines can be used by multiple concurrent threads • Synchronization uses shared variables
Example Program
(summing numbers)
Heavyweight UNIX processes
(Section 8.7.1)
Pseudo code
Create semaphores Allocate shared memory and attach shared memory Load array with numbers Fork child processes IF Parent THEN sum parent section ELSE sum child section P(semaphore) Add to global sum V(semaphore) IF (child) terminate ELSE join Print results Release semaphores, detatch and release shared memory Note: The Java and pthread version require about half the code
Modify Existing Language Syntax
Example Constructs
• Declaration of a shared memory variable
shared int x;
• Specify statements to execute concurrently
par { s1(); s2(); s3(); … sn(); }
• Iterations assigned to different processors
forall (i=0; i Examples – Outputs from one processor cannot be inputs to another – Outputs from the processors cannot overlap Instantiate and run a thread ThreadClass t = new ThreadClass().start(); Thread class Class ThreadClass extends Thread { public ThreadClass {//Constructor} public void run() { while (true) { //yield or sleep periodically. //thread code executed here. } } } IEEE POSIX 1003.1c 1995: UNIX-based C standardized API Advantages • Industry standardized interface which replaces vendor proprietary APIs • Thread creation, synchronization, and context switching are implemented in user space without kernel intervention, which is inherently more efficient than kernel-based thread operations • User-level implementation provides the flexibility to choose a scheduler that best suits the application, independent of the kernel scheduler. Drawbacks • Poor locality limits performance when accessing shared data across processors • The Pthreads scheduler hasn't proven suited to manage large numbers of threads • Shared memory multithreaded programs typically follow the SPMD model • Most parallel programs still are course-grain in design Pthreads versus Kernel Threads Real: wall clock time (actual elapsed time) User: time spent in user mode Sys: time spent in the kernel within the process • Extensions for C/C++, Fortran, and Java (JOMP) • Consists of: Compiler directives, library routines and environment variables • Recognized industry standard developed in the late 1990s • Designed for shared memory programming • Uses fork-join model, but uses threads • Parallel sections of code execute “teams of threads” • General Syntax – C: #pragma omp JOMP: //omp •
: High Performance Fortran and C
Compiler Optimizations • The following works because the statements are independent forall (i = 0; i < P; i++) a[i] = 0; • Bernsteins conditions
• Example: a = x + y; b = x + z; are okay to execute simultaneously
Java Threads •
•
Pthreads
Performance Comparisons
Compiler Extensions (openMP)