cs.brown.edu

Transcript cs.brown.edu

Futures, Scheduling, and Work
Distribution
Companion slides for
The Art of Multiprocessor Programming
by Maurice Herlihy & Nir Shavit
(Some images in this lecture courtesy of Charles
Leiserson)
How to write Parallel Apps?
• Split a program into parallel parts
• In an effective way
• Thread management
Art of Multiprocessor Programming
2
Matrix Multiplication
C    A   B 
Art of Multiprocessor Programming
3
Matrix Multiplication
N
X¡ 1
ci j =
ak i ¢bj k
k= 0
Art of Multiprocessor Programming
4
Matrix Multiplication
cij = k=0N-1 aki * bjk
Art of Multiprocessor Programming
5
Matrix Multiplication
class Worker extends Thread {
int row, col;
Worker(int row, int col) {
row = row; col = col;
}
public void run() {
double dotProduct = 0.0;
for (int i = 0; i < n; i++)
dotProduct += a[row][i] * b[i][col];
c[row][col] = dotProduct;
}}}
Art of Multiprocessor Programming
6
Matrix Multiplication
class Worker extends Thread {
int row, col;
Worker(int row, int col) {
row = row; col = col;
}
a thread
public void run() {
double dotProduct = 0.0;
for (int i = 0; i < n; i++)
dotProduct += a[row][i] * b[i][col];
c[row][col] = dotProduct;
}}}
Art of Multiprocessor Programming
7
Matrix Multiplication
class Worker extends Thread {
int row, col;
Worker(int row, int col) {
row = row; col = col;
}
public void run() {
Which
matrix entry
double dotProduct
= 0.0;
to compute
for (int i = 0; i < n; i++)
dotProduct += a[row][i] * b[i][col];
c[row][col] = dotProduct;
}}}
Art of Multiprocessor Programming
8
Matrix Multiplication
class Worker extends Thread {
int row, col;
Worker(int row, int col) {
row = row; col = col;
Actual computation
}
public void run() {
double dotProduct = 0.0;
for (int i = 0; i < n; i++)
dotProduct += a[row][i] * b[i][col];
c[row][col] = dotProduct;
}}}
Art of Multiprocessor Programming
9
Matrix Multiplication
void multiply() {
Worker[][] worker = new Worker[n][n];
for (int row …)
for (int col …)
worker[row][col] = new Worker(row,col);
for (int row …)
for (int col …)
worker[row][col].start();
for (int row …)
for (int col …)
worker[row][col].join();
}
Art of Multiprocessor Programming
10
Matrix Multiplication
void multiply() {
Worker[][] worker = new Worker[n][n];
for (int row …)
for (int col …)
worker[row][col] = new Worker(row,col);
for (int row …)
for (int col …)
worker[row][col].start();
for (int row …)
Create n x n
for (int col …)
threads
worker[row][col].join();
}
Art of Multiprocessor Programming
11
Matrix Multiplication
void multiply() {
Worker[][] worker = new Worker[n][n];
for (int row …)
Start them
for (int col …)
worker[row][col] = new Worker(row,col);
for (int row …)
for (int col …)
worker[row][col].start();
for (int row …)
for (int col …)
worker[row][col].join();
}
Art of Multiprocessor Programming
12
Matrix Multiplication
void multiply() {
Worker[][] worker = new Worker[n][n];
for (int row …)
Start them
for (int col …)
worker[row][col] = new Worker(row,col);
for (int row …)
for (int col …)
worker[row][col].start();
for (int row …)
Wait for
for (int col …)
them to
worker[row][col].join();
finish
}
Art of Multiprocessor Programming
13
Matrix Multiplication
void multiply() {
Worker[][] worker = new Worker[n][n];
for (int row …)
Start them
for (int col …)
worker[row][col] = new Worker(row,col);
for (int row …)
for (int
col …)wrong with this
What’s
worker[row][col].start();
picture?
for (int row …)
Wait for
for (int col …)
them to
worker[row][col].join();
finish
}
Art of Multiprocessor Programming
14
Thread Overhead
• Threads Require resources
– Memory for stacks
– Setup, teardown
– Scheduler overhead
• Short-lived threads
– Ratio of work versus overhead bad
Art of Multiprocessor Programming
15
Thread Pools
• More sensible to keep a pool of
– long-lived threads
• Threads assigned short-lived tasks
– Run the task
– Rejoin pool
– Wait for next assignment
Art of Multiprocessor Programming
16
Thread Pool = Abstraction
• Insulate programmer from platform
– Big machine, big pool
– Small machine, small pool
• Portable code
– Works across platforms
– Worry about algorithm, not platform
Art of Multiprocessor Programming
17
ExecutorService Interface
• In java.util.concurrent
– Task = Runnable object
• If no result value expected
• Calls run() method.
– Task = Callable<T> object
• If result value of type T expected
• Calls T call() method.
Art of Multiprocessor Programming
18
Future<T>
Callable<T> task = …;
…
Future<T> future = executor.submit(task);
…
T value = future.get();
Art of Multiprocessor Programming
19
Future<T>
Callable<T> task = …;
…
Future<T> future = executor.submit(task);
…
T value = future.get();
Submitting a Callable<T> task
returns a Future<T> object
Art of Multiprocessor Programming
20
Future<T>
Callable<T> task = …;
…
Future<T> future = executor.submit(task);
…
T value = future.get();
The Future’s get() method blocks
until the value is available
Art of Multiprocessor Programming
21
Future<?>
Runnable task = …;
…
Future<?> future = executor.submit(task);
…
future.get();
Art of Multiprocessor Programming
22
Future<?>
Runnable task = …;
…
Future<?> future = executor.submit(task);
…
future.get();
Submitting a Runnable task
returns a Future<?> object
Art of Multiprocessor Programming
23
Future<?>
Runnable task = …;
…
Future<?> future = executor.submit(task);
…
future.get();
The Future’s get() method blocks
until the computation is complete
Art of Multiprocessor Programming
24
Note
• Executor Service submissions
– Like New England traffic signs
– Are purely advisory in nature
• The executor
– Like the Boston driver
– Is free to ignore any such advice
– And could execute tasks sequentially …
Art of Multiprocessor Programming
25
Matrix Addition
 C 00

 C 10
C 00   A00  B 00

C 10   A10  B10
Art of Multiprocessor Programming
B 01  A01 

A11  B11 
26
Matrix Addition
 C 00

 C 10
C 00   A00  B 00

C 10   A10  B10
B 01  A01 

A11  B11 
4 parallel additions
Art of Multiprocessor Programming
27
Matrix Addition Task
class AddTask implements Runnable {
Matrix a, b; // add this!
public void run() {
if (a.dim == 1) {
c[0][0] = a[0][0] + b[0][0]; // base case
} else {
(partition a, b into half-size matrices aij and bij)
Future<?> f00 = exec.submit(addTask(a00,b00));
…
Future<?> f11 = exec.submit(addTask(a11,b11));
f00.get(); …; f11.get();
…
}}
Art of Multiprocessor Programming
28
Matrix Addition Task
class AddTask implements Runnable {
Matrix a, b; // add this!
public void run() {
if (a.dim == 1) {
c[0][0] = a[0][0] + b[0][0]; // base case
} else {
(partition a, b into half-size matrices aij and bij)
Future<?> f00 = exec.submit(addTask(a00,b00));
…
Future<?> f11 = exec.submit(addTask(a11,b11));
f00.get(); …; f11.get();
…
Constant-time operation
}}
Art of Multiprocessor Programming
29
Matrix Addition Task
class AddTask implements Runnable {
Matrix a, b; // add this!
public void run() {
if (a.dim == 1) {
c[0][0] = a[0][0] + b[0][0]; // base case
} else {
(partition a, b into half-size matrices aij and bij)
Future<?> f00 = exec.submit(addTask(a00,b00));
…
Future<?> f11 = exec.submit(addTask(a11,b11));
f00.get(); …; f11.get();
…
Submit 4 tasks
}}
Art of Multiprocessor Programming
30
Matrix Addition Task
class AddTask implements Runnable {
Matrix a, b; // add this!
public void run() {
if (a.dim == 1) {
c[0][0] = a[0][0] + b[0][0]; // base case
} else {
(partition a, b into half-size matrices aij and bij)
Future<?> f00 = exec.submit(addTask(a00,b00));
…
Future<?> f11 = exec.submit(addTask(a11,b11));
f00.get(); …; f11.get();
…
Base case: add directly
}}
Art of Multiprocessor Programming
31
Matrix Addition Task
class AddTask implements Runnable {
Matrix a, b; // multiply this!
public void run() {
if (a.dim == 1) {
c[0][0] = a[0][0] + b[0][0]; // base case
} else {
(partition a, b into half-size matrices aij and bij)
Future<?> f00 = exec.submit(addTask(a00,b00));
…
Future<?> f11 = exec.submit(addTask(a11,b11));
f00.get(); …; f11.get();
…
Let them finish
}}
Art of Multiprocessor Programming
32
Dependencies
• Matrix example is not typical
• Tasks are independent
– Don’t need results of one task …
– To complete another
• Often tasks are not independent
Art of Multiprocessor Programming
33
Fibonacci
1 if n = 0 or 1
F(n)
F(n-1) + F(n-2) otherwise
• Note
– Potential parallelism
– Dependencies
Art of Multiprocessor Programming
34
Disclaimer
• This Fibonacci implementation is
– Egregiously inefficient
• So don’t try this at home or job!
– But illustrates our point
• How to deal with dependencies
• Exercise:
– Make this implementation efficient!
Art of Multiprocessor Programming
35
Multithreaded Fibonacci
class FibTask implements Callable<Integer> {
static ExecutorService exec =
Executors.newCachedThreadPool();
int arg;
public FibTask(int n) {
arg = n;
}
public Integer call() {
if (arg > 2) {
Future<Integer> left = exec.submit(new FibTask(arg-1));
Future<Integer> right = exec.submit(new FibTask(arg-2));
return left.get() + right.get();
} else {
return 1;
}}}
Art of Multiprocessor Programming
36
Multithreaded Fibonacci
class FibTask implements Callable<Integer> {
static ExecutorService exec =
Executors.newCachedThreadPool();
int arg;
public FibTask(int n) {
arg = n;
Parallel calls
}
public Integer call() {
if (arg > 2) {
Future<Integer> left = exec.submit(new FibTask(arg-1));
Future<Integer> right = exec.submit(new FibTask(arg-2));
return left.get() + right.get();
} else {
return 1;
}}}
Art of Multiprocessor Programming
37
Multithreaded Fibonacci
class FibTask implements Callable<Integer> {
static ExecutorService exec =
Executors.newCachedThreadPool();
int arg;
public FibTask(int n) {
Pick up & combine results
arg = n;
}
public Integer call() {
if (arg > 2) {
Future<Integer> left = exec.submit(new FibTask(arg-1));
Future<Integer> right = exec.submit(new FibTask(arg-2));
return left.get() + right.get();
} else {
return 1;
}}}
Art of Multiprocessor Programming
38
The Blumofe-Leiserson DAG
Model
• Multithreaded program is
– A directed acyclic graph (DAG)
– That unfolds dynamically
• Each node is
– A single unit of work
Art of Multiprocessor Programming
39
Fibonacci DAG
fib(4)
Art of Multiprocessor Programming
40
Fibonacci DAG
fib(4)
fib(3)
Art of Multiprocessor Programming
41
Fibonacci DAG
fib(4)
fib(3)
fib(2)
fib(2)
Art of Multiprocessor Programming
42
Fibonacci DAG
fib(4)
fib(3)
fib(2)
fib(2)
fib(1)
fib(1)
fib(1)
fib(1)
fib(1)
Art of Multiprocessor Programming
43
Fibonacci DAG
fib(4)
get
call
fib(3)
fib(2)
fib(2)
fib(1)
fib(1)
fib(1)
fib(1)
fib(1)
Art of Multiprocessor Programming
44
How Parallel is That?
• Define work:
– Total time on one processor
• Define critical-path length:
– Longest dependency path
– Can’t beat that!
Art of Multiprocessor Programming
45
Unfolded DAG
Art of Multiprocessor Programming
46
Parallelism?
Serial fraction = 3/18 = 1/6 …
Amdahl’s Law says
speedup cannot exceed 6.
Art of Multiprocessor Programming
47
Work?
1
2
T1: time needed on one processor
3
4
11
5
7
7
8
12
13
15
9
10
Just count the nodes ….
14
16
17
18
T1 = 18
Art of Multiprocessor Programming
48
Critical Path?
∞: time needed on as many
T
processors as you like
Art of Multiprocessor Programming
49
Critical Path?
1
2
∞: time needed on as many
T
3
processors as you like
4
Longest path ….
5
6
7
8
9
T∞ = 9
Art of Multiprocessor Programming
50
Notation Watch
• TP = time on P processors
• T1 = work (time on 1 processor)
• T∞ = critical path length (time on ∞
processors)
Art of Multiprocessor Programming
51
Simple Laws
• Work Law: TP ≥ T1/P
– In one step, can’t do more than P work
• Critical Path Law: TP ≥ T∞
– Can’t beat infinite resources
Art of Multiprocessor Programming
52
Performance Measures
• Speedup on P processors
– Ratio T1/TP
– How much faster with P processors
• Linear speedup
– T1/TP = Θ(P)
• Max speedup (average parallelism)
– T1/T∞
Art of Multiprocessor Programming
53
Sequential Composition
A
B
Work: T1(A) + T1(B)
Critical Path: T∞ (A) + T∞ (B)
Art of Multiprocessor Programming
54
Parallel Composition
A
B
Work: T1(A) + T1(B)
Critical Path: max{T∞(A), T∞(B)}
Art of Multiprocessor Programming
55
Matrix Addition
 C 00

 C 10
C 00   A00  B 00

C 10   A10  B10
Art of Multiprocessor Programming
B 01  A01 

A11  B11 
56
Matrix Addition
 C 00

 C 10
C 00   A00  B 00

C 10   A10  B10
B 01  A01 

A11  B11 
4 parallel additions
Art of Multiprocessor Programming
57
Addition
• Let AP(n) be running time
– For n x n matrix
– on P processors
• For example
– A1(n) is work
– A∞(n) is critical path length
Art of Multiprocessor Programming
58
Addition
• Work is
Partition, synch, etc
A1(n) = 4 A1(n/2) + Θ(1)
4 spawned additions
Art of Multiprocessor Programming
59
Addition
• Work is
A1(n) = 4 A1(n/2) + Θ(1)
= Θ(n2)
Same as double-loop summation
Art of Multiprocessor Programming
60
Addition
• Critical Path length is
A∞(n) = A∞(n/2) + Θ(1)
spawned additions in
parallel
Partition, synch, etc
Art of Multiprocessor Programming
61
Addition
• Critical Path length is
A∞(n) = A∞(n/2) + Θ(1)
= Θ(log n)
Art of Multiprocessor Programming
62
Matrix Multiplication Redux
C    A   B 
Art of Multiprocessor Programming
63
Matrix Multiplication Redux
 C 11

C
 21
C 12   A11

C 22   A 21
A12   B 11

A 22   B 21
Art of Multiprocessor Programming
B 12 

B 22 
64
First Phase …
 C 11

C
 21
C 12   A11 B 11  A12 B 21

C 22   A 21 B 11  A 22 B 21
A11 B 12  A12 B 22 

A 21 B 12  A 22 B 22 
8 multiplications
Art of Multiprocessor Programming
65
Second Phase …
 C 11

C
 21
C 12   A11 B 11  A12 B 21

C 22   A 21 B 11  A 22 B 21
A11 B 12  A12 B 22 

A 21 B 12  A 22 B 22 
4 additions
Art of Multiprocessor Programming
66
Multiplication
• Work is
Final addition
M1(n) = 8 M1(n/2) + A1(n)
8 parallel
multiplications
Art of Multiprocessor Programming
67
Multiplication
• Work is
M1(n) = 8 M1(n/2) + Θ(n2)
= Θ(n3)
Same as serial triple-nested loop
Art of Multiprocessor Programming
68
Multiplication
• Critical path length is
Final addition
M∞(n) = M∞(n/2) + A∞(n)
Half-size parallel
multiplications
Art of Multiprocessor Programming
69
Multiplication
• Critical path length is
M∞(n) = M∞(n/2) + A∞(n)
= M∞(n/2) + Θ(log n)
= Θ(log2 n)
Art of Multiprocessor Programming
70
Parallelism
• M1(n)/ M∞(n) = Θ(n3/log2 n)
• To multiply two 1000 x 1000 matrices
– 10003/102=107
• Much more than number of processors
on any real machine
Art of Multiprocessor Programming
71
Shared-Memory Multiprocessors
• Parallel applications
– Do not have direct access to HW
processors
• Mix of other jobs
– All run together
– Come & go dynamically
Art of Multiprocessor Programming
72
Ideal Scheduling Hierarchy
Tasks
User-level scheduler
Processors
Art of Multiprocessor Programming
73
Realistic Scheduling Hierarchy
Tasks
User-level scheduler
Threads
Kernel-level scheduler
Processors
Art of Multiprocessor Programming
74
For Example
• Initially,
– All P processors available for application
• Serial computation
– Takes over one processor
– Leaving P-1 for us
– Waits for I/O
– We get that processor back ….
Art of Multiprocessor Programming
75
Speedup
• Map threads onto P processes
• Cannot get P-fold speedup
– What if the kernel doesn’t cooperate?
• Can try for speedup proportional to P
Art of Multiprocessor Programming
76
Scheduling Hierarchy
• User-level scheduler
– Tells kernel which threads are ready
• Kernel-level scheduler
– Synchronous (for analysis, not correctness!)
– Picks pi threads to schedule at step i
Art of Multiprocessor Programming
77
Greedy Scheduling
• A node is ready if
predecessors are done
• Greedy: schedule as
many ready nodes as
possible
• Optimal scheduler is
greedy (why?)
• But not every greedy
scheduler is optimal
Art of Multiprocessor Programming
78
Greedy Scheduling
There are P processors
Complete Step:
•>P nodes ready
• run any P
Incomplete Step:
• < P nodes ready
• run them all
Art of Multiprocessor Programming
79
Theorem
For any greedy scheduler,
TP ≤ T1/P + T∞
Art of Multiprocessor Programming
80
Theorem
For any greedy scheduler,
TP ≤ T1/P + T∞
Actual time
Art of Multiprocessor Programming
81
Theorem
For any greedy scheduler,
TP ≤ T1/P + T∞
No better than work divided
among processors
Art of Multiprocessor Programming
82
Theorem
For any greedy scheduler,
TP ≤ T1/P + T∞
No better than critical
path length
Art of Multiprocessor Programming
83
TP ≤ T1/P + T∞
Proof:
Number of incomplete steps ≤ T1/P …
… because each performs P work.
Number of complete steps ≤ T1 …
… because each shortens the
unexecuted critical path by 1
Art of Multiprocessor Programming
84
Near-Optimality
Theorem: any greedy scheduler is within a
factor of 2 of optimal.
Remark: Optimality is NP-hard!
Art of Multiprocessor Programming
85
Proof of Near-Optimality
Let TP* be the optimal time.
From work and
critical path laws
TP* ≥ max{T1/P, T∞}
TP ≤ T1/P + T∞
Theorem
TP ≤ 2 max{T1/P ,T∞}
TP ≤ 2 TP*
Art of Multiprocessor Programming
86
Work Distribution
zzz…
Art of Multiprocessor Programming
87
Work Dealing
Yes!
Art of Multiprocessor Programming
88
The Problem with
Work Dealing
D’oh!
D’oh!
D’oh!
Art of Multiprocessor Programming
89
Work Stealing
No work…
Yes!
Art of Multiprocessor Programming
90
Lock-Free Work Stealing
• Each thread has a pool of ready work
• Remove work without synchronizing
• If you run out of work, steal someone
else’s
• Choose victim at random
Art of Multiprocessor Programming
91
Local Work Pools
Each work pool is a Double-Ended Queue
Art of Multiprocessor Programming
92
Work DEQueue1
work
pushBottom
popBottom
1. Double-Ended Queue
Art of Multiprocessor Programming
93
Obtain Work
•Obtain work
•Run task until
•Blocks or terminates
popBottom
Art of Multiprocessor Programming
94
New Work
•Unblock node
•Spawn node
pushBottom
Art of Multiprocessor Programming
95
Whatcha Gonna do When the
Well Runs Dry?
@&%$!!
empty
Art of Multiprocessor Programming
96
Steal Work from Others
Pick random thread’s DEQeueue
Art of Multiprocessor Programming
97
Steal this Task!
popTop
Art of Multiprocessor Programming
98
Task DEQueue
• Methods
– pushBottom
– popBottom
– popTop
Never happen
concurrently
Art of Multiprocessor Programming
99
Task DEQueue
• Methods
– pushBottom
– popBottom
– popTop
Most common –
make them fast
(minimize use of
CAS)
Art of Multiprocessor Programming
100
Ideal
• Wait-Free
• Linearizable
• Constant time
Fortune Cookie: “It is better to be young, rich
and beautiful, than old, poor, and ugly”
Art of Multiprocessor Programming
101
Compromise
• Method popTop may fail if
– Concurrent popTop succeeds, or a
– Concurrent popBottom takes last task
Blame the
victim!
Art of Multiprocessor Programming
102
Dreaded ABA Problem
top
CAS
Art of Multiprocessor Programming
103
Dreaded ABA Problem
top
Art of Multiprocessor Programming
104
Dreaded ABA Problem
top
Art of Multiprocessor Programming
105
Dreaded ABA Problem
top
Art of Multiprocessor Programming
106
Dreaded ABA Problem
top
Art of Multiprocessor Programming
107
Dreaded ABA Problem
top
Art of Multiprocessor Programming
108
Dreaded ABA Problem
top
Art of Multiprocessor Programming
109
Dreaded ABA Problem
Yes!
top
CAS
Uh-Oh …
Art of Multiprocessor Programming
110
Fix for Dreaded ABA
stamp
top
bottom
Art of Multiprocessor Programming
111
Bounded DEQueue
public class BDEQueue {
AtomicStampedReference<Integer> top;
volatile int bottom;
Runnable[] tasks;
…
}
Art of Multiprocessor Programming
112
Bounded DEQueue
public class BDEQueue {
AtomicStampedReference<Integer> top;
volatile int bottom;
Runnable[] tasks;
…
}
Index & Stamp
(synchronized)
Art of Multiprocessor Programming
113
Bounded DEQueue
public class BDEQueue {
AtomicStampedReference<Integer> top;
volatile int bottom;
Runnable[] deq;
…
}
index of bottom task
no need to synchronize
memory barrier needed
Art of Multiprocessor Programming
114
Bounded DEQueue
public class BDEQueue {
AtomicStampedReference<Integer> top;
volatile int bottom;
Runnable[] tasks;
…
}
Array holding tasks
Art of Multiprocessor Programming
115
pushBottom()
public class BDEQueue {
…
void pushBottom(Runnable r){
tasks[bottom] = r;
bottom++;
}
…
}
Art of Multiprocessor Programming
116
pushBottom()
public class BDEQueue {
…
void pushBottom(Runnable r){
tasks[bottom] = r;
bottom++;
}
…
Bottom is the index to store
}
the new task in the array
Art of Multiprocessor Programming
117
pushBottom()
public class BDEQueue {
stamp
…
top
void pushBottom(Runnable
r){
tasks[bottom] = r;
bottom++;
}
bottom
…
}
Adjust the bottom index
Art of Multiprocessor Programming
118
Steal Work
public Runnable popTop() {
int[] stamp = new int[1];
int oldTop = top.get(stamp), newTop = oldTop + 1;
int oldStamp = stamp[0], newStamp = oldStamp + 1;
if (bottom <= oldTop)
return null;
Runnable r = tasks[oldTop];
if (top.CAS(oldTop, newTop, oldStamp, newStamp))
return r;
return null;
}
Art of Multiprocessor Programming
119
Steal Work
public Runnable popTop() {
int[] stamp = new int[1];
int oldTop = top.get(stamp), newTop = oldTop + 1;
int oldStamp = stamp[0], newStamp = oldStamp + 1;
if (bottom <= oldTop)
return null;
Runnable r = tasks[oldTop];
if (top.CAS(oldTop, newTop, oldStamp, newStamp))
return r;
return null;
}
Read top (value & stamp)
Art of Multiprocessor Programming
120
Steal Work
public Runnable popTop() {
int[] stamp = new int[1];
int oldTop = top.get(stamp), newTop = oldTop + 1;
int oldStamp = stamp[0], newStamp = oldStamp + 1;
if (bottom <= oldTop)
return null;
Runnable r = tasks[oldTop];
if (top.CAS(oldTop, newTop, oldStamp, newStamp))
return r;
return null;
}
Compute new value & stamp
Art of Multiprocessor Programming
121
Steal Work
public Runnable popTop() {
stamp
int[] stamp = new int[1];
top
int oldTop = top.get(stamp), newTop = oldTop + 1;
int oldStamp = stamp[0], newStamp = oldStamp + 1;
if (bottom <= oldTop)
bottom
return null;
Runnable r = tasks[oldTop];
if (top.CAS(oldTop, newTop, oldStamp, newStamp))
return r;
return null;
Quit if queue is empty
}
Art of Multiprocessor Programming
122
Steal Work
stamp
public Runnable popTop() {
top
CAS
int[] stamp = new int[1];
int oldTop = top.get(stamp), newTop = oldTop + 1;
bottom
int oldStamp = stamp[0], newStamp
= oldStamp + 1;
if (bottom <= oldTop)
return null;
Runnable r = tasks[oldTop];
if (top.CAS(oldTop, newTop, oldStamp, newStamp))
return r;
return null;
}
Try to steal the task
Art of Multiprocessor Programming
123
Steal Work
public Runnable popTop() {
int[] stamp = new int[1];
int oldTop = top.get(stamp), newTop = oldTop + 1;
int oldStamp = stamp[0], newStamp = oldStamp + 1;
if (bottom <= oldTop)
return null;
Runnable r = tasks[oldTop];
if (top.CAS(oldTop, newTop, oldStamp, newStamp))
return r;
Give up if
return null;
}
conflict occurs
Art of Multiprocessor Programming
124
Take Work
Runnable popBottom() {
if (bottom == 0) return null;
bottom--;
Runnable r = tasks[bottom];
int[] stamp = new int[1];
int oldTop = top.get(stamp), newTop = 0;
int oldStamp = stamp[0], newStamp = oldStamp + 1;
if (bottom > oldTop) return r;
if (bottom == oldTop){
bottom = 0;
if (top.CAS(oldTop, newTop, oldStamp, newStamp))
return r;
}
top.set(newTop,newStamp); return null;
bottom = 0; }
Art of Multiprocessor Programming
125
Take Work
Runnable popBottom() {
if (bottom == 0) return null;
bottom--;
Runnable r = tasks[bottom];
int[] stamp = new int[1];
int oldTop = top.get(stamp), newTop = 0;
int oldStamp = stamp[0], newStamp = oldStamp + 1;
if (bottom > oldTop) return r;
if (bottom == oldTop){
bottom = 0;
if (top.CAS(oldTop, newTop, oldStamp, newStamp))
return r;
}
top.set(newTop,newStamp); return null;
bottom = 0; }
Art of Multiprocessor Programming
Make sure queue is non-empty
126
Take Work
Runnable popBottom() {
if (bottom == 0) return null;
bottom--;
Runnable r = tasks[bottom];
int[] stamp = new int[1];
int oldTop = top.get(stamp), newTop = 0;
int oldStamp = stamp[0], newStamp = oldStamp + 1;
if (bottom > oldTop) return r;
if (bottom == oldTop){
bottom = 0;
if (top.CAS(oldTop, newTop, oldStamp, newStamp))
return r;
}
top.set(newTop,newStamp); return null;
bottom = 0; }
Art of Multiprocessor Programming
Prepare to grab bottom task
127
Take Work
Runnable popBottom() {
if (bottom == 0) return null;
bottom--;
Runnable r = tasks[bottom];
int[] stamp = new int[1];
int oldTop = top.get(stamp), newTop = 0;
int oldStamp = stamp[0], newStamp = oldStamp + 1;
if (bottom > oldTop) return r;
if (bottom == oldTop){
bottom = 0;
if (top.CAS(oldTop, newTop, oldStamp, newStamp))
return r;
}
top.set(newTop,newStamp); return null;
bottom = 0; }
Art of Multiprocessor Programming
Read top, & prepare new values
128
Take Work
Runnable popBottom() {
if (bottom == 0) return null;
stamp
bottom--;
top
Runnable r = tasks[bottom];
int[] stamp = new int[1];
int oldTop = top.get(stamp), bottom
newTop = 0;
int oldStamp = stamp[0], newStamp = oldStamp + 1;
if (bottom > oldTop) return r;
if (bottom == oldTop){
bottom = 0;
if (top.CAS(oldTop, newTop, oldStamp, newStamp))
return r;
}
top.set(newTop,newStamp); return null;
bottom = 0;}
Art of Multiprocessor Programming
If top & bottom one or more apart,
no conflict
129
Take Work
Runnable popBottom() {
if (bottom == 0) return null;
stamp
bottom--;
top
Runnable r = tasks[bottom];
int[] stamp = new int[1];
int oldTop = top.get(stamp), bottom
newTop = 0;
int oldStamp = stamp[0], newStamp = oldStamp + 1;
if (bottom > oldTop) return r;
if (bottom == oldTop){
bottom = 0;
if (top.CAS(oldTop, newTop, oldStamp, newStamp))
return r;
}
top.set(newTop,newStamp); return null;
bottom = 0;}
Art of Multiprocessor Programming
At most one item left
130
Take Work
Runnable popBottom() {
if (bottom == 0) return null;
bottom--;
Runnable r = tasks[bottom];
int[] stamp = new int[1];
int oldTop = top.get(stamp), newTop = 0;
int oldStamp = stamp[0], newStamp = oldStamp + 1;
if (bottom > oldTop) return r;
if (bottom == oldTop){
bottom = 0;
if (top.CAS(oldTop, newTop, oldStamp, newStamp))
return r;
}
top.set(newTop,newStamp); return null;
bottom = 0;}
Art of Multiprocessor Programming
Try to steal last task.
Reset bottom because the
DEQueue will be empty
even if unsuccessful (why?)
131
Take Work
Runnable popBottom() {
if (bottom == 0) return null; stamp
bottom--;
top
CAS
Runnable r = tasks[bottom];
int[] stamp = new int[1];
bottom
int oldTop = top.get(stamp), newTop = 0;
int oldStamp = stamp[0], newStamp = oldStamp + 1;
if (bottom > oldTop) return r;
if (bottom == oldTop){
bottom = 0;
if (top.CAS(oldTop, newTop, oldStamp, newStamp))
return r;
}
top.set(newTop,newStamp); return null;
bottom = 0;}
Art of Multiprocessor Programming
I win CAS
132
Take Work
Runnable popBottom() {
if (bottom == 0) return null; stamp
bottom--;
top
CAS
Runnable r = tasks[bottom];
int[] stamp = new int[1];
bottom
int oldTop = top.get(stamp), newTop = 0;
int oldStamp = stamp[0], newStamp = oldStamp + 1;
if (bottom > oldTop) return r;
if (bottom == oldTop){
bottom = 0;
if (top.CAS(oldTop, newTop, oldStamp, newStamp))
return r;
}
top.set(newTop,newStamp); return null;
bottom = 0;}
Art of Multiprocessor Programming
If I lose CAS, thief
must have won…
133
Take Work
Runnable popBottom() {
if (bottom == 0) return null;
bottom--;
Runnable r = tasks[bottom];
int[] stamp = new int[1];
int oldTop = top.get(stamp), newTop = 0;
int oldStamp = stamp[0], newStamp = oldStamp + 1;
if (bottom > oldTop) return r;
if (bottom == oldTop){
bottom = 0;
if (top.CAS(oldTop, newTop, oldStamp, newStamp))
return r;
}
top.set(newTop,newStamp); return null;
bottom = 0;}
Art of Multiprocessor Programming
Failed to get last task
(bottom could be less than top)
Must still reset top and bottom
since deque is empty
134
Old English Proverb
• “May as well be hanged for stealing a
sheep as a goat”
• From which we conclude:
– Stealing was punished severely
– Sheep were worth more than goats
Art of Multiprocessor Programming
135
Variations
• Stealing is expensive
– Pay CAS
– Only one task taken
• What if
– Move more than one task
– Randomly balance loads?
Art of Multiprocessor Programming
136
Work Balancing
d22+5e/2=4
b2+5c/ 2 = 3
5
Art of Multiprocessor Programming
137
Work-Balancing Thread
public void run() {
int me = ThreadID.get();
while (true) {
Runnable task = queue[me].deq();
if (task != null) task.run();
int size = queue[me].size();
if (random.nextInt(size+1) == size) {
int victim = random.nextInt(queue.length);
int min = …, max = …;
synchronized (queue[min]) {
synchronized (queue[max]) {
balance(queue[min], queue[max]);
}}}}}
Art of Multiprocessor Programming
138
Work-Balancing Thread
public void run() {
int me = ThreadID.get();
while (true) {
Runnable task = queue[me].deq();
if (task != null) task.run();
int size = queue[me].size();
if (random.nextInt(size+1) == size) {
int victim = random.nextInt(queue.length);
int min = …, max = …;
synchronized (queue[min]) { Keep running
synchronized (queue[max]) {
balance(queue[min], queue[max]);
}}}}}
Art of Multiprocessor Programming
tasks
139
Work-Balancing Thread
public void run() {
With probability
int me = ThreadID.get();
1/|queue|
while (true) {
Runnable task = queue[me].deq();
if (task != null) task.run();
int size = queue[me].size();
if (random.nextInt(size+1) == size) {
int victim = random.nextInt(queue.length);
int min = …, max = …;
synchronized (queue[min]) {
synchronized (queue[max]) {
balance(queue[min], queue[max]);
}}}}}
Art of Multiprocessor Programming
140
Work-Balancing Thread
public void run() {
int me = ThreadID.get();
Choose random
while (true) {
Runnable task = queue[me].deq();
if (task != null) task.run();
int size = queue[me].size();
if (random.nextInt(size+1) == size) {
int victim = random.nextInt(queue.length);
int min = …, max = …;
synchronized (queue[min]) {
synchronized (queue[max]) {
balance(queue[min], queue[max]);
}}}}}
Art of Multiprocessor Programming
victim
141
Work-Balancing Thread
public void run() {
int me = ThreadID.get();
Lock queues in canonical
while (true) {
Runnable task = queue[me].deq();
if (task != null) task.run();
int size = queue[me].size();
if (random.nextInt(size+1) == size) {
int victim = random.nextInt(queue.length);
int min = …, max = …;
synchronized (queue[min]) {
synchronized (queue[max]) {
balance(queue[min], queue[max]);
}}}}}
Art of Multiprocessor Programming
order
142
Work-Balancing Thread
public void run() {
int me = ThreadID.get();
Rebalance queues
while (true) {
Runnable task = queue[me].deq();
if (task != null) task.run();
int size = queue[me].size();
if (random.nextInt(size+1) == size) {
int victim = random.nextInt(queue.length);
int min = …, max = …;
synchronized (queue[min]) {
synchronized (queue[max]) {
balance(queue[min], queue[max]);
}}}}}
Art of Multiprocessor Programming
143
Work Stealing & Balancing
• Clean separation between app &
scheduling layer
• Works well when number of processors
fluctuates.
• Works on “black-box” operating systems
Art of Multiprocessor Programming
144
This work is licensed under a Creative Commons AttributionShareAlike 2.5 License.
•
•
•
•
•
You are free:
– to Share — to copy, distribute and transmit the work
– to Remix — to adapt the work
Under the following conditions:
– Attribution. You must attribute the work to “The Art of Multiprocessor
Programming” (but not in any way that suggests that the authors endorse
you or your use of the work).
– Share Alike. If you alter, transform, or build upon this work, you may
distribute the resulting work only under the same, similar or a compatible
license.
For any reuse or distribution, you must make clear to others the license terms of
this work. The best way to do this is with a link to
– http://creativecommons.org/licenses/by-sa/3.0/.
Any of the above conditions can be waived if you get permission from the
copyright holder.
Nothing in this license impairs or restricts the author's moral rights.
Art of Multiprocessor Programming
145
TOM
M AR V O L O
R I D DL E
Art of Multiprocessor Programming
146

cs.brown.edu

Transcript cs.brown.edu

Directory