Mostly Parallel Garbage Collection

Download Report

Transcript Mostly Parallel Garbage Collection

Hans-J. Boehm
Alan J. Demers
Scott Shenker
Presented by Kit Cischke
MOSTLY PARALLEL GARBAGE
COLLECTION
Outline
 Introduction
 Basics of Garbage Collection revisited
 How do you make a GC for non-GC languages?
 Oh, and making it parallel would be nice.
 Or at least mostly parallel.
 The Basic Idea
 Virtual dirty bits to find the reachable set
 Sweeping doesn’t matter.
 Try telling that to your mother.
 For performance, the sweep step is practically ignorable.
 Formalisms
 Let’s introduce some notation and concepts for how this
should work.
Outline, Part Deux
 Implementation Choices
 Based on our formalisms, which is the best
combination to actually use?
 Brief Results
 Really, the test hardware was a SPARC station
configured with as little as 10MB of RAM!
 Mostly Parallel Copying Collectors
 This is a mark-sweep paper, mostly. Could you build a
copy collector?
 Onwards!
GC Taxonomy and Our Choices
 Garbage Collectors may be Reference-Counting
or Tracing based.
 The authors focus on tracing out from the root set.
 The basic style of many early collectors was “stop-the-
world” collection.
 Generational and parallel collectors attempt to
mitigate the potentially long delays while the
world is stopped.
 Generational collectors just collect a small part of the
heap.
 Parallel collectors might be generational, but they
mainly try to collect the whole heap, but in parallel
with the mutator(s).
So why “mostly” parallel?
 Think back to the VM migration papers. Migration started
while the VM was running.
 Or in parallel.
 But at some point, the VM had to be stopped to complete
the transfer.
 Hopefully, by that time, there was very little to transfer.
 Same idea here: as much collection is done as possible
while the mutator is running. At some point, we need to
stop the world to finish the collection.
 For mostly the same kind of reason: the mutator will do things
after the collector has made decisions on certain pointers which
render the pointer un/reachable.
 This is meaningful because we don’t want the collector running
all the time.
Authors’ Two Stated Goals
 “Present a method for transforming a stop-theworld racing collector into a mostly parallel
collector.”
 And to make the solution general to copying/non-
copying or generational/non-generational collectors.
 Furthermore, no OS changes are needed.
 “Describe a particular implementation of a
garbage collector that illustrates this idea.”
 What’s really cool is that it will provide GC to
languages like C with relatively short pause times.
Basic Idea
 Every program has a root set. The root set
forms the foundation for the immune set, or
the set of objects that are reachable or live.
 Tracing the path of pointers from the root set
finds live, reachable objects that are marked.
 Unmarked (and therefore unreachable) objects
can be collected.
More Basic Idea
 Key idea: Whenever a virtual memory page is
written to, set a virtual dirty bit for that page.
 At the beginning of a collection, clear all the dirty
bits. Start tracing.
 The tracing finds all currently reachable objects
while the mutator keeps doing its thing.
 Writes introduce dirty pages.
 When the original trace is done, stop the world
and trace out marked objects on dirty pages.
 Now everything reachable is marked.
 But is it safe to say everything unreachable is not
marked?
A Compromise
 No, the collector is neither purely parallel nor
precise.
 The duration of the stop-the-world pause is directly
dependent on the number of dirtied pages.
 In theory, things can be no worse than a whole-heap stop-
the-world collection. The authors claim this doesn’t
happen.
 Not all unreachable objects are collected, as they
may have been marked before the mutator
dismissed them.
 The collector is complete, in that eventually that memory
will be reclaimed. (Just not right now!)
Sweeping Doesn’t Matter
 Phase 2 of a mark-sweep collector is to free
the unused memory in whatever form that
takes – called sweeping.
 Sweeping doesn’t need to occur during the
world stoppage. Once we know what’s
garbage, we can sweep interleaved with
object allocation.
Sweeping Implemented Here
 The heap is split into blocks. Each block contains objects of
a certain size.
 For small objects, the block size is the same as a physical page of
memory.
 After marking, pages are queued for sweeping in one of
multiple queues (one per object size).
 Each object size also has a free list. When it is empty, the
allocator sweeps the front of the queue for that object size
and restores that memory to the free list.
 Blocks for larger objects are swept in large increments
immediately following a collection.
 This limits CPU time consumed by the collection.
 The net effect is that GC times are dominated by the
marking.
Let’s Get Formal
 Definition: A partial collection only reclaims
some subset of the unreachable objects.
 Let the set T contain all threatened objects (that
is, objects that might be collected).
 Let the set I contain all immune objects (that is,
objects that will not be collected).
 T and I are disjoint. All objects fall into either T or I.
 For a full collection, I contains only the roots. In
a partial collection, there are additional objects.
 A collection is correct iff no reachable objects are
collected.
Guaranteeing Correctness
 Reclaim only unmarked objects when the
following condition is true:
 C: Every object in I is marked and every
object pointed to by a marked object is also
marked.
Stop-the-World Collection





Formalizing stop-the-world collection:
Step 1: Stop the world
Step 2: Clear all mark bits
Step 3: Perform the tracing operation TR.
Step 4: Restart the world
 The operation TR:
TR: Mark all objects in I and trace from them.
 At the end of this 4-step operation, condition C
holds, and all unmarked objects can be collected.
Parallel Collection







Formally, mostly parallel collection requires:
Step 1: Clear all mark bits
Step 2: Clear all virtual dirty bits
Step 3: Perform the tracing op TR.
Step 4: Stop the world
Step 5: Perform a finishing operation, F
Step 6: Restart the world.
 The Finishing Operation
F: Trace from all marked objects on dirty pages.
Notes on that Collection
 TR is performed totally in parallel with the mutator,
which is dirtying pages that will need to be traced.
 The closure condition C does not hold after step 4
(stop-the-world), which is what requires the finishing
step F.
 We will define a weaker closure C’:
C’: Every object in I is marked and every object
pointed to by a marked object on a clean page is
also marked.
 Applying F to any state satisfying C’ will produce C.
Considerations
 Thus we have a correct, mostly-parallel
collection.
 But, if we have a busy mutator, we might have
lots of dirty pages, which in turn implies long
pauses during the world stoppage.
 To shorten this delay, we can clean the pages in
parallel.
 Let P be a set of pages. Then the process M is:
M: 1.) Atomically retrieve and clear the virtual
dirty bits from P. 2.) Trace from the marked
objects on the dirty pages of P.
Generational Partial Collection
 All of that formally describes a general partial
collection.
 Now let’s consider a generational collector
that uses the mark bits for object age.
 Consider a partial collector where I is chosen
to be the set of currently marked objects.
 Therefore, C’ holds.
 We could be done by simply performing F,
but to reduce the delay, we perform M to the
entire heap just before the world stoppage.
Formal Parallel Generational
Collection




1. Perform M on the heap.
2. Stop the world.
3. Perform F.
4. Restart the world.
 Because an object that has been marked will
never be collected by the generational
collector, we occasionally need to run a full
collection.
An Alternate Version of M
 M’ could be:
M’: 1.) Atomically retrieve and clear the dirty
bits from the pages P, and 2.) for all
unmarked objects pointed to by marked
objects on dirty pages of P, mark them and
dirty the pages on which they reside..
 Iteratively performing M’ can substitute for
M, though M is generally preferable.
Implementation Choices
 When and how to use M and M’.
 No M’.
 For allocation-intensive mutators, run M more than once (twice
seems to be the sweet spot).
 What is a “full collection” going to be, and when to run it?
 Initially triggered on heap exhaustion.
 However, the allocating thread would be stalled, even with the
parallel collector.
 Settled on a daemon thread that kicks off the collector if the
amount of used memory exceeds some threshold above what
was being used at the end of the last collection.
 Then we run up to two iterations of M, then a concurrent
execution of TR.
 If we run out of memory, we try to expand the heap.
Brief Results
 This collector was used at Xerox PARC for quite a
while, heavily optimized.
 They didn’t modify the SunOS running on their
machines, but just write-protected the heap.
 Mainly interested in measuring interactive
response.
 Subjectively better. (But they are aware this is pretty
fuzzy.)
 Ran 5 iterations of a “Boyer benchmark” and an
allocator loop at various memory configurations,
trying to even the playing field for full,
generational and parallel generational collectors.
Results
Mostly Parallel Copying
Collectors
 We can do all the same things and make a
copying collector, if we want.
 It just requires space to maintain explicit
forwarding links.
 A forward pointer is associated with each object,
used only by the GC.
 Reachable objects are copied from from-space to
to-space, writing the new address into the
forward pointer in from-space.
 The mutator only sees the from-space pointers.
More on Copy Collectors
 Concurrent collection forces the following to be
true:
 If an object residing on a clean page has been copied,
then everything it points to has also been copied.
 If an object resides on a clean page, its current
contents are up-to-date.
 With the world stopped, we can execute the
finishing operation shown on the next slide such
that all reachable objects are found with correct
contents in to-space.
Copying Finishing Op
 Fc: For every object a whose from-space copy resides on a
dirty page:
 1. Copy everything it points to that hasn’t already been copied.
 2. Update pointers to point to to-space.
 3. Recopy a to reflect both pointer and non-pointer fields that
occurred since the collection started.
 Could create a concurrent version of Fc, but the authors
found a copy collector to be impractical for their
environment and didn’t both implementing one.
 Just like with the mark-sweep collector, the world-stoppage
time is proportional to the number of dirtied pages.
Questions?
 I really liked the tone of this paper. It had less
of that stuffy, self-important academic tone.