מצגת של PowerPoint - Tel Aviv University
Download
Report
Transcript מצגת של PowerPoint - Tel Aviv University
Parallel, Incremental, and
Mostly Concurrent GC
Yoav Ossia
IBM Haifa Research Laboratory
November, 2002
Motivation
Modern SMP servers introduce
H
Higher level of true parallelism
R
Multi-gigabyte heaps
L
Multi-threaded applications which must ensure fast response time
New demands from GC
Short pause time on large heaps
Minimal throughput hit
Scaleability on multi-processor hardware
Efficient algorithms for weak ordering hardware
We will not talk about this....
Workarounds, which do not work...
Bigger heaps
Object pooling
Outline
Introduction
H
Principles
of concurrent collector
R
L
Dividing
the concurrent work
Parallel load balancing mechanism
Coding issues
Results (highlights)
Recent work
Goals of this lecture
Present the ideas and algorithms
H
As done in the IBM JVM
R
Introduce
the concerns of implementation
L
The devil's in the details
What's cheap and what's expensive
How does this gets into the design
Concurrent coding, the real story
Difficulties, and how to avoid them
Mark Sweep Compact GC (MSC)
Mark - traces all reachable (live) objects in heap
H
Needed data structures and operations:
R Mark Stack (push & pop of references to objects)
L SetMarked (Object *obj) & boolean IsMarked (Object *obj)
Start from all roots (Threads' stacks, system classes, etc)
Mark, and Push to mark stack
Process the mark stack, till empty
Pop Obj, and trace (mark and push all unmarked references from Obj)
Sweep
Coalesce unmarked objects into free chunks
Create a list of free chunks
Compact (the disaster zone)
Usually done when allocation request cannot be satisfied
Move and coalesce the live objects, to create bigger free chunks
Usually very long, unattractive
The MSC Process
Stacks
Globals
H
R
L
Mark
Sweep
Compact ?
Used
Marked
Free
Dark matter
Unmovable
MSC Subtle Issues
Mark
H
Proportional to amount of live objects only.
R
Risk of mark stack Overflow (e.g., linked list with Next as last field)
L
Avoid paging and cache misses
Sweep
Walk all the heap again
Naive method is proportional to amount of live AND dead objects
Partially solved by mark bit-vector
Bit for each basic heap unit (typically 8 bytes) Object mapped to specific bit
Walk the mark bit-vector, inspect heap only when found big holes in vector
Compact
Tradeoff : level of compaction Vs. speed
If not type-accurate GC, not all objects can be moved
Can't tell if a slot on stack is a reference, or a numeric value
Parallel STW MSC on N-Way
Usually GC thread per processor
H
Parallel
mark
R
L
Synchronized
marking
Load balancing needed (overflow, starvation)
Separate mark stacks for each thread
Stealing from designated "private" areas, attached to mark stacks (Endo et al)
Direct access to mark stacks of other threads (Flood et al)
Adding a single shared stack (Cheng & Blelloch)
Parallel sweep
Heap divided to M areas M > f * N
Synchronization needed for area selection, and free list concatenation
Compact
Tradeoff : Parallelism Vs. auxiliary data structures
Outline
Introduction
H
Principles
of concurrent collector
R
L
Dividing
the concurrent work
Parallel load balancing mechanism
Coding issues
Results (highlights)
Recent work
The Concurrent Collection Principle
STW may get to seconds
H
Cost of mark phase dominant
R
L
Mark
may be done while program
is active, except that the objects graph changes...
Correctness kept by use of Write barrier
Is activated at each change of reference field in an object (gray it!)
Functionality is algorithm-dependent, sometimes expensive
The Concurrent Collection
What, Who, How
Mostly concurrent MSC
H
Tracing
done while mutator threads are active (Boehm et al, Printezis
& Detlefs)
R
L Retrace (Clean) may be done while mutator threads are active
Short final STW
Last clean and resulting tracing, and Sweep
Originally done by separate thread/processor
"Real" concurrent
Incremental work
Tracing done incrementally by the mutator threads
First done on a copying collector (Baker)
Parallel execution
Concurrent phase is also parallel
Many threads can do concurrent work simultaneously
The IBM Mostly Concurrent GC
Putting together all existing elements
H
First production-level parallel incremental mostly concurrent MSC
Rcollector
L
Combining incremental and concurrent tracing
Efficient concurrent marking that terminates on time
New mechanism for parallel load balancing
Especially fit for dynamic number of participating threads
When compared to mature industry-quality GC
Drastic reduction in pause time (more than 75%)
Small throughput hit (~10%)
Phases of The Collector
Concurrent phase
H
Tracing of all reachable objects
R Done incrementally by Java mutators and dedicated low-priority tracing
threads
L
Write Barrier records changes per region (card) in a card table
Any change of reference in an object dirty the card
Alll black objects in card changed to gray
Fast and relatively cheap operation (2% - 6% throughput hit)
A single card cleaning pass
In each dirty card, retrace all marked objects
Cleaning may precede the actual tracing
Final STW phase
Root scanning and final card cleaning pass
Tracing of all additional objects
Parallel sweep
Now replaced by concurrent sweep
CPU Distribution
H
R
L
Write Barrier
Activated by the JVM, on each reference change
done in Java
H
R
Writes
in a card table
L
Each card 512 bytes in heap
Cleaning (concurrent or final) may happen anytime
Foo.a = O1
Store O1 in a root (guarantied to be reachable)
Set Foo.a to O1
Activate Write Barrier on Foo
Dirty the enrty of Foo in the card table
Remove O1 from root
Object may span on many cards
Usually mapped to the card where its header starts
Outline
Introduction
H
Principles
of concurrent collector
R
L
Dividing
the concurrent work
Parallel load balancing mechanism
Coding issues
Results (highlights)
Recent work
The Problem of Punctual
termination
Traditional STW Collection starts when heap is full
H
produces minimal number of GCs
R
L Mostly concurrent aims at completing
the
Concurrent Marking when heap becomes full
If heap get filled before CM terminates, rest of marking moved to
final phase
Longer pause
If CM terminate before heap is filled, choose the lesser of two
evils:
Wait for heap to be filled (and accumulate more dirty cards)
Initiate an "early" GC (more GCs, with all their additional costs)
Concurrent marking should be adaptive
Combining Concurrent and
Incremental
Existing approaches
H
Incremental
R
Mutators do tracing proportional to allocation (Tracing Rate)
L Tracing guarantied to terminate (more or less) on time
Decreases application performance
Specialized GC threads (concurrent)
More efficient, better CPU utilization
Tracing rate determined by ratio between GC and program threads
Usually not changed by the collector
No control of termination point
Hybrid usage of both ways
Low priority background threads fully utilize CPU idle time
Not controlled by the tracing rate
Mutators perform incremental tracing, to ensure proper termination
Only if tracing goals not met by background threads
Control "Milestones" (concurrent start, card cleaning pass start, etc.)
Metering Formulas
Kickoff point of concurrent
User-specified Tracing Rate (TR)
H
Live
Objects estimation (Lest), Dirty objects estimation (Mest)
R
Start
concurrent when free space gets below (Lest+Mest) / TR
L
So tracing rate, applied to remaining allocations, match the tracing work
Calculating The work
Amount of concurrently traced objects (Traced)
Amount of remaining free memory (Free)
Estimated background threads tracing rate (Best)
Rate between total amounts of background tracing and allocations
Dynamically recalculate actual rate
ATR = ((Lest + Mest - Traced) / Free)
Rate between remaining work and free space
Account for work done in background ATR2 = ATR - Best
Trace only if background tracing lags
Behavior Patterns
H
R
L
CPU
Usage
STW MSC GC
Throughput 100%
Time
Con. Tr Rate 3
Throughput 80%
Con. Tr Rate 8
Throughput 90%
Parallel
STW
Java
mutation
Incremental
tracing
Background
tracing
Con. Tr Rate 8
CPU 80%
Throughput 95%
Con. Tr Rate 8
CPU 50%
Throughput 110%
Outline
Introduction
H
Principles
of concurrent collector
R
L
Dividing
the concurrent work
Parallel load balancing mechanism
Coding issues
Results (highlights)
Recent work
Load Balancing - the Problem
Even distribution of objects between parallel tracing
H
threads
R
Avoid mark stack overflow and/or starvation of threads
L
Suitable for unknown number to collection threads
Efficient synchronization
Supply simple termination detection
Existing approaches
All use separate mark stacks for each thread
Stealing from designated "private" areas, attached to mark stacks
(Endo et al)
Direct access to mark stacks of other threads (Flood et al)
Adding a single shared stack (Cheng & Blelloch)
Load Balancing for Concurrent
Pools of WorkPackets
H
Each is a smaller mark stack
Cheap
Get/Put synchronization
R
(compare
& swap)
L
Separate pools for different occupancy
Each pool maintains a counter
Full
Non-Full
Non-Empty
Tracing thread uses 2 WorkPacket
Empty
Objects are popped from the input WP
Newly marked object are pushed to the output WP
Empty input WP is returned to the "Empty" pool
New "as-full-as-possible" WP is then fetched
Full output WP is returned to the "Full" pool
New "as-empty-as-possible" WP is then fetched
Different object graph traversal
BFS, limited by the capacity of a WorkPacket
Advantages of WorkPackets
Fair competition when input is scarce
H
All threads get same chance for tracing input
R
Simple
detection of tracing state
L
Overflow - All packets are full
Scaleability is possible, simply allocate more WPs
Starvation - Only empty WPs available, but not all WPs in the
"Empty" list
Termination - all WPs in the "Empty" list
Positive results measured
Low cost of synchronization
Fair distribution of work among threads
Outline
Introduction
H
Principles
of concurrent collector
R
L
Dividing
the concurrent work
Parallel load balancing mechanism
Coding issues
Results (highlights)
Recent work
Concurrent Code Maintenance
Extremely difficult to verify:
H
Races between concurrent tracing and program
R
Races between concurrent tracers
L
Timing is a major factor
Debug version cannot reproduce Release bugs
Problems surface only occasionally
Behavior is machine dependent
About 40% verification code
Sanity checks
Asserts, consistency checks
Logging of collection activity, state, and history
Shadow heap, for tracing history
Shadow card table, for card state and treatment
Code to use the above for printing detailed information.
Outline
Introduction
H
Principles
of concurrent collector
R
L
Dividing
the concurrent work
Parallel load balancing mechanism
Results (highlights)
Recent work
Comparison with STW GC
Compared
to STW MSC
H
R
Using IBM's production level JVM
L4-way machines
NT, AIX and IA64
Mostly testing SPECjbb
Server-side Java
Throughput driven
60% live objects
Pause time cut by 75%
Mark time by 86%.
Sweep become dominant
Throughput hit of 10%
Comparison with STW GC (cont.)
Also testing pBOB
H
IBM internal benchmark
R
Fit for 2.5 GB heap, with
L
low CPU utilization
Many threads
Effects of Various Tracing Rates
Mutator utilization - amount of Java mutation
done during the concurrent phase
H
Also
controls the size of the "per thread mini-STW"
R
L
Effects of Various Tracing Rates
Floating garbage - marked objects that become unreachable before
Hthe final STW phase.
RAmount of cards cleaned
L
References
A parallel, incremental and concurrent GC for servers.
H Ben-Yitzhak, Goft, Kolodner, Leikehman, Owshanko. PLDI '02
Ossia,
R
Mostly parallel garbage collection.
L
Boehm, Demers, Shenker. ACM SIGPLAN Notices, 1991.
On-the-fly Garbage Collection: An exercise in
cooperation.
Dijkstra, Lamport, Martin, Scholten, Steffens. ACM comm., 1978.
A generational mostly-concurrent garbage collector.
Printezis, Detlefs. ISMM 2000
And many more...
Outline
Dividing the concurrent work
H
Parallel
R
L
load balancing mechanism
Results (highlights)
Recent work
Introduction
Concurrent Sweep
Sweep became the dominant part of the remaining
H
pause
time
R
Except
L
for the
needed initial allocation,
rest of sweep can be
deferred
Concurrent sweep done incrementally
After the final phase, and before the next concurrent collection
Work done on each allocation request
No additional performance cost
Improving the Low Tracing Rate
Low tracing rate is more application friendly
H
More CPU left to program
R
Shorter tracing periods forced on threads
L
But throughput is reduced
Goal : Improve throughput with minimal hit on pause
times
Achieved by reducing dirty cards, and floating
garbage.
Better performance
Reduced heap residency
End