מצגת של PowerPoint - Tel Aviv University

Download Report

Transcript מצגת של PowerPoint - Tel Aviv University

Parallel, Incremental, and
Mostly Concurrent GC
Yoav Ossia
IBM Haifa Research Laboratory
November, 2002
Motivation
Modern SMP servers introduce
H
Higher level of true parallelism
R
Multi-gigabyte heaps
L
Multi-threaded applications which must ensure fast response time
New demands from GC
Short pause time on large heaps
Minimal throughput hit
Scaleability on multi-processor hardware
Efficient algorithms for weak ordering hardware
 We will not talk about this....
Workarounds, which do not work...
Bigger heaps
Object pooling
Outline
Introduction
H
Principles
of concurrent collector
R
L
Dividing
the concurrent work
Parallel load balancing mechanism
Coding issues
Results (highlights)
Recent work
Goals of this lecture
Present the ideas and algorithms
H
As done in the IBM JVM
R
Introduce
the concerns of implementation
L
The devil's in the details
What's cheap and what's expensive
How does this gets into the design
Concurrent coding, the real story
Difficulties, and how to avoid them
Mark Sweep Compact GC (MSC)
Mark - traces all reachable (live) objects in heap
H
Needed data structures and operations:
R  Mark Stack (push & pop of references to objects)
L  SetMarked (Object *obj) & boolean IsMarked (Object *obj)
Start from all roots (Threads' stacks, system classes, etc)
 Mark, and Push to mark stack
Process the mark stack, till empty
 Pop Obj, and trace (mark and push all unmarked references from Obj)
Sweep
Coalesce unmarked objects into free chunks
Create a list of free chunks
Compact (the disaster zone)
Usually done when allocation request cannot be satisfied
Move and coalesce the live objects, to create bigger free chunks
Usually very long, unattractive
The MSC Process
Stacks
Globals
H
R
L
Mark
Sweep
Compact ?
Used
Marked
Free
Dark matter
Unmovable
MSC Subtle Issues
Mark
H
Proportional to amount of live objects only.
R
Risk of mark stack Overflow (e.g., linked list with Next as last field)
L
Avoid paging and cache misses
Sweep
Walk all the heap again
Naive method is proportional to amount of live AND dead objects
Partially solved by mark bit-vector
 Bit for each basic heap unit (typically 8 bytes) Object mapped to specific bit
 Walk the mark bit-vector, inspect heap only when found big holes in vector
Compact
Tradeoff : level of compaction Vs. speed
If not type-accurate GC, not all objects can be moved
 Can't tell if a slot on stack is a reference, or a numeric value
Parallel STW MSC on N-Way
Usually GC thread per processor
H
Parallel
mark
R
L
Synchronized
marking
Load balancing needed (overflow, starvation)
 Separate mark stacks for each thread
 Stealing from designated "private" areas, attached to mark stacks (Endo et al)
 Direct access to mark stacks of other threads (Flood et al)
 Adding a single shared stack (Cheng & Blelloch)
Parallel sweep
Heap divided to M areas M > f * N
Synchronization needed for area selection, and free list concatenation
Compact
Tradeoff : Parallelism Vs. auxiliary data structures
Outline
Introduction
H
Principles
of concurrent collector
R
L
Dividing
the concurrent work
Parallel load balancing mechanism
Coding issues
Results (highlights)
Recent work
The Concurrent Collection Principle
STW may get to seconds
H
Cost of mark phase dominant
R
L
Mark
may be done while program
is active, except that the objects graph changes...
Correctness kept by use of Write barrier
Is activated at each change of reference field in an object (gray it!)
Functionality is algorithm-dependent, sometimes expensive
The Concurrent Collection
What, Who, How
Mostly concurrent MSC
H
Tracing
done while mutator threads are active (Boehm et al, Printezis
& Detlefs)
R
L  Retrace (Clean) may be done while mutator threads are active
Short final STW
 Last clean and resulting tracing, and Sweep
Originally done by separate thread/processor
"Real" concurrent
Incremental work
Tracing done incrementally by the mutator threads
First done on a copying collector (Baker)
Parallel execution
Concurrent phase is also parallel
 Many threads can do concurrent work simultaneously
The IBM Mostly Concurrent GC
Putting together all existing elements
H
First production-level parallel incremental mostly concurrent MSC
Rcollector
L
Combining incremental and concurrent tracing
Efficient concurrent marking that terminates on time
New mechanism for parallel load balancing
Especially fit for dynamic number of participating threads
When compared to mature industry-quality GC
Drastic reduction in pause time (more than 75%)
Small throughput hit (~10%)
Phases of The Collector
Concurrent phase
H
Tracing of all reachable objects
R  Done incrementally by Java mutators and dedicated low-priority tracing
threads
L
Write Barrier records changes per region (card) in a card table
 Any change of reference in an object dirty the card
Alll black objects in card changed to gray
 Fast and relatively cheap operation (2% - 6% throughput hit)
A single card cleaning pass
 In each dirty card, retrace all marked objects
 Cleaning may precede the actual tracing
Final STW phase
Root scanning and final card cleaning pass
Tracing of all additional objects
Parallel sweep
 Now replaced by concurrent sweep
CPU Distribution
H
R
L
Write Barrier
Activated by the JVM, on each reference change
done in Java
H
R
Writes
in a card table
L
 Each card 512 bytes in heap
Cleaning (concurrent or final) may happen anytime
Foo.a = O1
 Store O1 in a root (guarantied to be reachable)
 Set Foo.a to O1
 Activate Write Barrier on Foo
 Dirty the enrty of Foo in the card table
 Remove O1 from root
Object may span on many cards
 Usually mapped to the card where its header starts
Outline
Introduction
H
Principles
of concurrent collector
R
L
Dividing
the concurrent work
Parallel load balancing mechanism
Coding issues
Results (highlights)
Recent work
The Problem of Punctual
termination


Traditional STW Collection starts when heap is full
H
produces minimal number of GCs
R
L Mostly concurrent aims at completing
the
Concurrent Marking when heap becomes full
 If heap get filled before CM terminates, rest of marking moved to
final phase
Longer pause
If CM terminate before heap is filled, choose the lesser of two
evils:
Wait for heap to be filled (and accumulate more dirty cards)
Initiate an "early" GC (more GCs, with all their additional costs)

Concurrent marking should be adaptive
Combining Concurrent and
Incremental
Existing approaches
H
Incremental
R
 Mutators do tracing proportional to allocation (Tracing Rate)
L  Tracing guarantied to terminate (more or less) on time
 Decreases application performance
Specialized GC threads (concurrent)
 More efficient, better CPU utilization
 Tracing rate determined by ratio between GC and program threads
Usually not changed by the collector
 No control of termination point
Hybrid usage of both ways
Low priority background threads fully utilize CPU idle time
 Not controlled by the tracing rate
Mutators perform incremental tracing, to ensure proper termination
 Only if tracing goals not met by background threads
 Control "Milestones" (concurrent start, card cleaning pass start, etc.)
Metering Formulas
Kickoff point of concurrent
User-specified Tracing Rate (TR)
H
Live
Objects estimation (Lest), Dirty objects estimation (Mest)
R
Start
concurrent when free space gets below (Lest+Mest) / TR
L
 So tracing rate, applied to remaining allocations, match the tracing work
Calculating The work
Amount of concurrently traced objects (Traced)
Amount of remaining free memory (Free)
Estimated background threads tracing rate (Best)
 Rate between total amounts of background tracing and allocations
Dynamically recalculate actual rate
ATR = ((Lest + Mest - Traced) / Free)
 Rate between remaining work and free space
Account for work done in background ATR2 = ATR - Best
 Trace only if background tracing lags
Behavior Patterns
H
R
L
CPU
Usage
STW MSC GC
Throughput 100%
Time
Con. Tr Rate 3
Throughput 80%
Con. Tr Rate 8
Throughput 90%
Parallel
STW
Java
mutation
Incremental
tracing
Background
tracing
Con. Tr Rate 8
CPU 80%
Throughput 95%
Con. Tr Rate 8
CPU 50%
Throughput 110%
Outline
Introduction
H
Principles
of concurrent collector
R
L
Dividing
the concurrent work
Parallel load balancing mechanism
Coding issues
Results (highlights)
Recent work
Load Balancing - the Problem
Even distribution of objects between parallel tracing
H
threads
R
Avoid mark stack overflow and/or starvation of threads
L
Suitable for unknown number to collection threads
Efficient synchronization
Supply simple termination detection
Existing approaches
All use separate mark stacks for each thread
Stealing from designated "private" areas, attached to mark stacks
(Endo et al)
Direct access to mark stacks of other threads (Flood et al)
Adding a single shared stack (Cheng & Blelloch)
Load Balancing for Concurrent
Pools of WorkPackets
H
Each is a smaller mark stack
Cheap
Get/Put synchronization
R
(compare
& swap)
L
Separate pools for different occupancy
Each pool maintains a counter
Full
Non-Full
Non-Empty
Tracing thread uses 2 WorkPacket
Empty
Objects are popped from the input WP
Newly marked object are pushed to the output WP
Empty input WP is returned to the "Empty" pool
 New "as-full-as-possible" WP is then fetched
Full output WP is returned to the "Full" pool
 New "as-empty-as-possible" WP is then fetched
Different object graph traversal
BFS, limited by the capacity of a WorkPacket
Advantages of WorkPackets
Fair competition when input is scarce
H
All threads get same chance for tracing input
R
Simple
detection of tracing state
L
Overflow - All packets are full
 Scaleability is possible, simply allocate more WPs
Starvation - Only empty WPs available, but not all WPs in the
"Empty" list
Termination - all WPs in the "Empty" list
Positive results measured
Low cost of synchronization
Fair distribution of work among threads
Outline
Introduction
H
Principles
of concurrent collector
R
L
Dividing
the concurrent work
Parallel load balancing mechanism
Coding issues
Results (highlights)
Recent work
Concurrent Code Maintenance
Extremely difficult to verify:
H
Races between concurrent tracing and program
R
Races between concurrent tracers
L
Timing is a major factor
 Debug version cannot reproduce Release bugs
 Problems surface only occasionally
 Behavior is machine dependent
About 40% verification code
Sanity checks
 Asserts, consistency checks
Logging of collection activity, state, and history
 Shadow heap, for tracing history
 Shadow card table, for card state and treatment
 Code to use the above for printing detailed information.
Outline
Introduction
H
Principles
of concurrent collector
R
L
Dividing
the concurrent work
Parallel load balancing mechanism
Results (highlights)
Recent work
Comparison with STW GC
Compared
to STW MSC
H
R
Using IBM's production level JVM
L4-way machines
NT, AIX and IA64
Mostly testing SPECjbb
Server-side Java
Throughput driven
60% live objects
Pause time cut by 75%
Mark time by 86%.
Sweep become dominant
Throughput hit of 10%
Comparison with STW GC (cont.)
Also testing pBOB
H
IBM internal benchmark
R
Fit for 2.5 GB heap, with
L
low CPU utilization
Many threads
Effects of Various Tracing Rates
Mutator utilization - amount of Java mutation
 done during the concurrent phase
H
Also
controls the size of the "per thread mini-STW"
R
L
Effects of Various Tracing Rates
Floating garbage - marked objects that become unreachable before
Hthe final STW phase.
RAmount of cards cleaned
L
References
A parallel, incremental and concurrent GC for servers.
H Ben-Yitzhak, Goft, Kolodner, Leikehman, Owshanko. PLDI '02
Ossia,
R
Mostly parallel garbage collection.
L
Boehm, Demers, Shenker. ACM SIGPLAN Notices, 1991.
On-the-fly Garbage Collection: An exercise in
cooperation.
Dijkstra, Lamport, Martin, Scholten, Steffens. ACM comm., 1978.
A generational mostly-concurrent garbage collector.
Printezis, Detlefs. ISMM 2000
And many more...
Outline
Dividing the concurrent work
H
Parallel
R
L
load balancing mechanism
Results (highlights)
Recent work
Introduction
Concurrent Sweep
Sweep became the dominant part of the remaining
H
pause
time
R
Except
L
for the
needed initial allocation,
rest of sweep can be
deferred
 Concurrent sweep done incrementally
After the final phase, and before the next concurrent collection
Work done on each allocation request
No additional performance cost
Improving the Low Tracing Rate
Low tracing rate is more application friendly
H
More CPU left to program
R
Shorter tracing periods forced on threads
L
But throughput is reduced
Goal : Improve throughput with minimal hit on pause
times
Achieved by reducing dirty cards, and floating
garbage.
Better performance
Reduced heap residency
End