Haskell on a Shared-Memory Multiprocessor

Download Report

Transcript Haskell on a Shared-Memory Multiprocessor

Haskell on a Shared-Memory
Multiprocessor
Tim Harris
Simon Marlow
Simon Peyton Jones
Why now?
• Shift in the balance:
– no more free sequential performance
boosts
– SMP hardware will be the norm
– non-parallel programs will be frozen in
performance
– even a modest parallel speedup is now
worthwhile, because the other processors
come for free
• race to produce good parallel languages
The story so far…
• Parallel FP research is not new, but
– it has mostly focussed on distributed
memory, and hence separate heaps:
• communication is expensive, so careful tuning of
work distribution is needed
– multi-core processors (for small N) will be
shared memory, we can use a single heap:
• almost zero communication overhead means
better prospects for reliable speedup
• tradeoffs are likely to be quite different
• less scalability beyond small N
Concurrent Haskell
• Concurrent programming in Haskell is exciting right
now:
– STM means less error-prone concurrent programming
– we understand how Concurrent Haskell interacts with OS
level concurrency and the FFI
– lots of people are using it
• Concurrent programs are Parallel programs too
– so we already have plenty of parallel programs to play with
– to say it another way: we can use Concurrent Haskell to write
parallel programs (no need for parallel annotations like par
straight away)
So, what’s the problem?
• Suppose we let 2 Haskell threads loose on a
shared heap. What goes wrong?
– allocation: the threads better have separate
allocation areas
– immutable heap objects present no problems (and
are common!)
– mutable objects: MVars, TVars. We better
make sure that these are thread-safe.
– shared data in the runtime: eg. the scheduler’s run
queue, the garbage collector’s remembered set.
Access to these must be made thread-safe.
– but…
The real problem is Thunks!
evaluation:
allocation:
x
let
x = fac z
in
x * 2
z
THUNK:
fac z
IND
stack
update
returned
value
Should we lock thunks?
• Thunks are clearly shared mutable
state, so we should protect against
simultaneous access with a mutex,
right?
THUNK
Free vars
Locks are v. expensive
• A lock is implemented using a
guaranteed atomic instruction, such as
compare-and-swap.
• These instructions are about 100x more
expensive than ordinary instructions
• We measured adding two CAS
instructions to every thunk evaluation,
result was about 50% worse
performance.
Can we do it lock-free?
THUNK
Free vars
• What would go wrong if we let them
both evaluate it?
– they both compute the same value…
– just extra work
– most thunks are cheap
Not quite that simple…
• Race between update and entry:
IND
THUNK
Free vars
IND
Value
Value
Hardware re-ordering?
• Not all processors guarantee strong
memory ordering
– no read ordering: processor might observe
the writes in a different order
– no write ordering: header might be written
before value, or worse, the value itself
might be written after the update
– Happily, x86 currently guarantees both
read & write ordering
Hardware re-ordering cont.
• No write ordering => we need a memory
barrier (could be expensive!)
• write ordering but no read ordering:
THUNK
0
Free vars
Initialise padding field to 0
Can we reduce duplication?
update
update
THUNK
• idea:
– periodically scan each thread’s stack
– attempt to claim exclusive access to each thunk
under evaluation
– halt any duplicate evaluation
Claiming a thunk
• traverse a thread’s stack,
• when we reach an update frame,
atomically swap the header word of the
thunk with BLACKHOLE
THUNK
BLACKHOLE
update
0
Free vars
Claiming a thunk
•
If the header was previously:
1. a THUNK, we have now claimed it
2. BLACKHOLE, another thread owns it
3. IND, another thread has already updated it
update
Duplicate
Evaluation
BLACKHOLE
0
Free vars
What happens to the duplicate evaluation?
AP_STACK
IND
BLACKHOLE
update
This thread has
claimed this
thunk.
• Well-known technique (Reid ’99), also used in
asynchronous exceptions and STM.
Stopping duplicate evaluation, cont.
BLACKHOLE
update
Block
Another thread
has claimed this
thunk.
• The thread blocks until the BLACKHOLE
has completed evaluation
Claiming thunks
• Works like real locking for long-running
thunks, compared to lock-free execution for
short-lived thunks, precisely what we want
• Must mark update frames for thunks we have
claimed, so we don’t attempt to claim twice.
• If a thread has claimed a thunk, this does not
necessarily mean that it is the only thread
evaluating it. The other thread(s) may not
have tried to claim it yet.
Evaluating a BLACKHOLE, blocking
• What if a thread enters a BLACKHOLE, i.e. a
claimed thunk?
• The thread must block.
• In single-threaded GHC, we attached blocked
threads to the BLACKHOLE itself.
– easy to find the blocked threads when updating
the BLACKHOLE, but
– in a multi-threaded setting this leads to more race
conditions on the thunk
– so we must store the queue of blocked threads in a
separate list, and check it periodically
Black-holing
• Black-holing has been around for a
while. It also:
– fixes some space leaks
– catches some loops
• We are just extending the existing
black-holing technique to catch
duplicate work in SMP-GHC.
Narrowing the window: grey-holing
• ToDo
More possibilities for duplication
z = let x = … expensive …
in Just x
• two threads evaluate z simultaneously,
creating two copies of x
• x is duplicated for ever
• we can try to catch this at the update:
if we update an IND, then return the
other value. Not foolproof.
STM(?)
• ToDo
Measurements
• using real locks
Measurements
• our lock-free implementation
Case study: parallelising GHC --make
• GHC –-make compiles multiple modules in
dependency order
• .hi files for library modules are read
once and shared by future compilations
• we want to parallelise compilations of
independent modules, while
synchronising access to the shared
state
parallel compilation
Main
in parallel
A
B
C
GHC’s shared state
Main
A
B
C
• It’s a dataflow graph!
• one thread for each node, blocks until results are
available from all the inputs
• parallel compilation happens automatically
• simple throttling to prevent too many simultaneous
compilations.
Results: ideal 2 identical modules
• Why not a speedup of 2?
– GC is single threaded, more GC when compiling in
parallel (more live data)
– dependency anal is single threaded
– interface loading is shared
– increased load on the memory system
• discounting GC, we get speedup of 1.54.
• speedup of 1.3 used 1.5 CPUs
Results: compiling Happy
• Modules are not completely
independent, speedup drops to 1.2
Results: compiling Anna
• larger program
• make –j2 is now losing
• better parallel speedup when optimising:
– probably lower proportion of time spent reading
interface files,
– and proportionally lower contention for shared
state
Conclusion & what’s next?
• lock-free thunk evaluation looks promising
• current issues:
– lock contention in the runtime
– lack of processor affinity
– combination leads to dramatic slowdown for some examples,
particularly concurrent programs
• we are redesigning the scheduler to fix these issues
• multithreaded GC
– tricky, but well-understood
– benefits everyone on multi-core/multi-proc
• Apps!
• planned full support for SMP in GHC 6.6