A Block-structured Heap Simplifies Parallel GC

Download Report

Transcript A Block-structured Heap Simplifies Parallel GC

Parallel garbage collection
with a block-structured heap
Simon Marlow (Microsoft Research)
Simon Peyton Jones (Microsoft Research)
Roshan James (U. Indiana)
Tim Harris (Microsoft Research)
What’s parallel GC?
A parallel Haskell program
Mutator
Mutator
GC
GC
Time
The GC is single-threaded, so Amdahl’s Law limits the amount of
speedup we can expect. GC is the bottleneck for parallel execution.
A single-threaded Haskell program (e.g. GHC itself)
Mutator
Setting the scene…
• NB. parallel, not concurrent
– “Parallel” means we stop all the mutator threads
and perform GC using multiple GC threads.
– “Concurrent” means running the GC at the same
time as the mutator
– concurrent GC is more difficult, and can impose
some overhead on the mutator, since the mutator
must synchronise with the GC.
• Target: commodity multi-cores
– for now we aim at <=8 cores, which may mean we
sometimes choose low overhead over scalability.
Later we may revisit these decisions.
High-level structure
• Key design decision: our storage
manager is divided into two layers.
• The block allocator
– requests memory from the OS
– provides blocks of memory to
the rest of the RTS
– manages a pool of free blocks.
• The GC allocates memory
from the block layer only.
GC
Block Allocator
malloc() / mmap()
Blocks of memory
• Memory is managed in units of the fixed
block size (currently 4k).
• Blocks can be allocated singly, or in
contiguous groups to accomodate larger
objects.
• Blocks can be linked together into lists to
form areas
Why divide memory into blocks?
• Flexibility, Portability
1. The storage manager needs to track multiple
regions (e.g. generations) which may grow and
shrink over time.
– contiguous memory would be problematic: how much
space do we allocate to each one, so that they don’t
grow into each other?
– Address space is limited.
– With linked lists of blocks, we can grow or shrink a region
whenever we like by adding/removing blocks.
2. managing large objects is easier: each gets its own
block group, and we can link large objects onto
lists, so no need to copy the object to move it.
– some wastage due to slop (<1%)
More advantages of blocks
3. Portable: all we require from the OS is a way to
allocate memory, there are no dependencies on
address-space layout.
4. Memory can be recycled quickly: cache-friendly
5. Sometimes we need to allocate when it is
inconvenient to GC: we can always grab another
block.
6. A “useful granularity”
– useful for performing occasional checks during
execution (e.g. context switching)
– for dividing up the work in a parallel GC…
How do blocks work?
• Each block has a fixed table of data:
1st free
the “block descriptor”.
byte in
the block
Start of
Number
of
the
block
blocks in
group (0 if
not the
head)
struct bdescr {
void *start, *free, *link;
int blocks;
…
}
bdescr *Bdescr (void *);
bdescr *allocBlocks (int blocks);
void
freeBlocks (bdescr *);
Bdescr()
maps
chains blocks
address
to(or
block
together
descriptor
in aof
few
links to head
instructions,
no
group)
memory accesses
Where do block descriptors live?
• Choice 1: at the start of a block.
–  Bdescr(p) is one instruction: p & 0xfffff000
–  Bad for cache & TLB.
• We often traverse block descriptors: if they are scattered
all over memory this thrashes the TLB.
–  Contiguous multi-block groups can only have a
descriptor for the first block.
–  A block contains “4k minus a bit” space for data
(awkward)
Choice 2
• Block descriptors are grouped
together, and can be found by an
address calculation.
–  Bdescr(p) is ~6 instructions (next
slide…)
–  We can keep block descriptors
together, better for cache/TLB
–  Contiguous block groups are easier: all
blocks in the group have descriptors.
–  Blocks contain a full 4k of data
Megablocks
2m bytes
Block
1
Block
2
Block
N
2k bytes
2m bytes aligned
The block allocator requests memory from the
operating system in units of a Megablock,
which is divided into N blocks and block
descriptors.
Megablocks
2m bytes
Block
1
Block
2
2k bytes
2m bytes aligned
Empty
Block Block
desc. desc.
1
2
2d
Block
N
Bdescr(p) = ((p & 2m-1) >> k) << d)
| (p & ~2m-1)
Block
desc.
N
Parallel GC
• First we consider how to parallelise 2-space copying
GC, and then extend this to generational GC.
Background: Cheney copying collection
Roots point to live objects
Allocation area
Copy live objects to to-space
To-space
Scan live objects for more roots
Complete when scan
pointer catches up with
allocation pointer.
How can we parallelise copying GC?
• Basically, we want each CPU to copy and scan
different objects.
• The main problem is finding an effective way to
partition the problem, so we can keep all CPUs
busy all the time (load balancing).
• Static partitioning (eg. partition the heap by
address) is not good:
– live data might not be evenly distributed, leading to
poor load balancing
– Need synchronisation when pointers cross partition
boundaries
Load balancing
• So we need dynamic load-balancing for GC.
– the pending set is the set of objects copied but not yet
scanned.
– Each CPU:
while (pending set non-empty) {
remove an object p from the pending set
scan(p)
}
– where
scan(p) {
for each object q pointed to by p {
if q has not been copied {
copy q
add q to the pending set
}
}
}
– (Need synchronisation to prevent two threads copying the
same object – later)
The Pending Set
– Now the problem is reduced to finding a good
representation for the pending set.
– Operations on the pending set are in the inner loop,
so heavyweight synchronisation would be very
expensive.
– In standard copying GC the pending set is
represented implicitly (and cheaply!) by to-space.
Hence any explicit representation will incur an
overhead compared to single-threaded GC, and eat
into any parallel speedup.
Previous solutions
• per-CPU work-stealing queues (Flood et. al.
(2001)).
– good load balancing, but
– some administrative overhead (quantity
unknown)
– needs clever lock-free data structures
– needs some strategy for overflow (GC can’t use
arbitrary extra memory!)
• Dividing the pending set into chunks
(Imai/Tick (1993), others).
– coarser granularity reduces synchronisation
overhead
– less effective load-balancing, especially if the
chunk size is too high.
How blocks help
• Since to-space is already a list of blocks, it is a
natural representation for a chunked pending
set!
– No need for a separate pending set representation,
no extra admin overhead relative to single threaed
GC.
– Larger blocks => lower synchronisation overhead
– Smaller blocks => better load balancing
But what if…
• … the pending set isn’t large enough to fill a
block? E.g. If the heap consists of a single linked
list of integers, then the scan pointer will always
be close to the allocation pointer, we will never
generate a full block of work.
• There may be little available parallelism in the
heap structure anyway.
• But with e.g. 2 linked lists, we would still not be
able to parallelise on two cores, because the
scan pointer will only be 2 words behind the
allocation pointer.
Available parallelism
• There should be enough parallelism, at
least in old-gen collections.
GC data structures
= free memory
Pending set
= not scanned
= scanned
done list
GC thread 1
Workspace
GC thread 2
Workspace
Inside a workspace…
Scan block
Alloc block
= free memory
= not scanned
= scanned
Scan pointer
Alloc pointer
• Objects are copied into the Alloc block (thread-private
allocation!)
• Loop:
–
–
–
–
Grab a block to be scanned from the pending set
Scan it
Push it back to the “done” list
When an Alloc block becomes full, move it to the pending set,
grab an empty block
Inside a workspace…
Scan block
Alloc block
= free memory
= not scanned
= scanned
Scan pointer
Alloc pointer
• When the pending set is empty
–
–
–
–
Make the Scan block = the Alloc block
Scan until complete
Look for more full blocks…
Note that we may now have scanned part of the
Alloc block, so we need to remember what portion
has been scanned. (full details of the algorithm are
in the paper).
Termination
• Keep a global counter of running threads
• When a thread finds no work to do, it
decrements the count of running threads
• If it finds the count = 0, all the work is
done; stop.
• Poll for work: if there is work to do,
increment the counter and continue
(don’t remove the work from the queue
yet).
When work is scarce
• We found that often the pending set is small or
empty (despite earlier measurements), leading to
low parallelism.
• The only solution is to export smaller chunks of
work to the pending set.
• We use a dynamic chunk size: when the pending
set is low, we export smaller chunks.
• smaller chunks leads to a fragmentation problem:
we want to fill up the rest of the block later, so
we have to keep track of these partially-full
blocks in per-thread lists.
Forwarding pointers and synchronisation
Header
FWD
Payload
Object is copied
Into to-space
Header
Payload
Overwrite with a
forwarding pointer
• Must synchronise if two threads attempt to copy the same object,
otherwise the object is duplicated.
• Use CAS to install the forwarding pointer; if another thread installs the
pointer first, return it (don’t copy the object). One CAS per object!
• If the object is immutable, then we don’t mind coying it twice, and in this
case we could omit the CAS (but note that the forwarding pointer must
not overwrite the payload).
Overhead due to atomic forwarding
Parallel Generational-copying GC
• Fairly straightforward generalisation of
parallel copying GC
• Three complications:
– maintaining the remembered sets
– deciding which pending set to take work
from
– tenuring policy
• see paper for the details
Speedup results
Measuring load balancing
• Ctot is the total copied by all threads
• Cmax is the maximum copied by any thread
• Work balance factor
• Perfect balance = N, perfect imbalance = 1.
• balance factor = maximum possible speedup
given the distribution of work across CPUs
(speedup might be lower for other reasons).
Load balancing results
Status
• Already checked in, you can try it out
right now in GHC HEAD.
• Tested on a GHC build – saved about ~6%
• Will be available in GHC 6.10 (autumn
2008)
• Multi-threaded GC will usually be a win
on 2+ cores, although requires increasing
the heap size to get the most benefit:
parallelising small GCs doesn’t work so
well.
A war story…
• This GC was first implemented by Roshan James, in the
summer of 2006.
– measurements showed negative speedup
• Re-implemented by SimonM in 2007
– also achieved negative speedup
– despite having good load-balancing.
• The cause of the bottleneck:
– after copying an object, a pointer in the block descriptor was
updated. Adjacent block descriptors sometimes share a cache
line, so multiple threads were writing to the same cache line
=> Very Bad.
– It took multiple man-weeks and 3 profiling tools to find the
problem.
– Solution: cache the pointer in thread-local storage.
Further work
• Investigate/improve load-balancing
• Avoid locking for immutable objects
– Contention is very low
– We might get a tiny amount of duplication per GC
done!
• Independent minor GCs.
– Hard to parallelise minor GC: too quick, not enough parallelism
– Stopping the world for minor GC is a severe bottleneck in a
program running on multiple CPUs.
– So do per-CPU independent minor GCs.
– Main techincal problem: either track or prevent inter-minorgeneration pointers. (eg. Doligez/Leroy(1993) for ML,
Steensgaard(2001)).
• Concurrent marking, with simple sweep: blocks with no
live objects can be freed immediately, compact or copy
partly
occasionally to recover fragmentation.
done!
Optimisations…
• There is a long list of tweaks and optimisations
that we tried, some of which helped, some
didn’t.
– Move block descriptors to the beginning of the block:
bad cache/TLB effects.
– Prefetching: no joy, too fragile, and recent CPUs do
automatic prefetching
– Should the pending block set be FIFO or LIFO? or
something else?
– Some objects don’t need to be scanned, copy them
to a separate non-scanned area (not worthwhile)