Microarchitectural Characterization of Production JVMs and

Transcript Microarchitectural Characterization of Production JVMs and

Microarchitectural Characterization
of Production JVMs
and Java Workload
work in progress
Jungwoo Ha (UT Austin)
Magnus Gustafsson (Uppsala Univ.)
Stephen M. Blackburn (Australian Nat’l Univ.)
Kathryn S. McKinley (UT Austin)
Challenges of JVM Performance Analysis

Controlling nondeterminism




Production JVMs are not created equal



Just-In-Time Compilation driven by nondeterministic
sampling
Garbage Collectors
Other Helper Threads
Thread model (kernel, user threads)
Type of helper threads
Need a solid measurement methodology!

2/22/08
Isolate each JVM part
2
Forest and Trees

What performance metrics explain performance
differences and bottlenecks?



Cache miss? L1 or L2?
TLB miss?
# of instructions?

Inspecting one or two metrics is not always enough

Performance counters give us only small number of
counters at a time

2/22/08
Multiple invocation for the measurement inevitable
3
Case Study: jython

Application performance (Cycles)
2/22/08
4
Case Study: jython

L1 Instruction cache miss/cyc
2/22/08
5
Case Study: jython

L1 Data cache miss/cyc
2/22/08
6
Case Study: jython

Total Instruction executed (retired)
2/22/08
7
Case Study: jython

L2 Data cache miss/cycle
2/22/08
8
Project Status

Established methodology to characterize
application code performance
 Large
number of metrics (40+) measured from
hardware performance counters
 apples to apple comparison of JVMs using
standard interface (JVMTI, JNI)

Simulator data for detail analysis
 Limit
studies
 What
 More
performance metrics
 e.g.
2/22/08
if L1 cache had no misses?
uop mix
9
Performance Counter Methodology

Collecting n metric

x warmup iterations (x = 10)
 p performance counters (can measure at most p metrics per iter.)
 n/p iterations needed for measurement
 k redundant measurement for statistical validation (k = 1)

Need to hold workload constant for multiple measurements
Invoke JVM
y times
Warmup JVM
1st – xth iteration
Stop JIT
(x+1)th iteration
Full Heap GC
Measured Run
(x+2)th – (x+2+(n/p)k)th
iteration
change metric
2/22/08
10
Performance Counter Methodology

Stop-the-world Garbage Collector


One perfctr instance per pthread


No concurrent marking
JVM internal threads are different pthreads from the
application
JVMTI Callbacks




2/22/08
Thread start - start counter
Thread finish - stop counter
GC start - pause counter, only for userlevel thread
GC stop - resume counter, only for userlevel thread
11
Methodology Limitations

Cannot factor out memory barrier overhead
 Use
garbage collector with the least application
overhead

If a helper thread runs in the same pthread
with the application (user-level thread), it will
cause perturbation
 No

evidence in J9, HotSpot, JRockit
Instrumented code overhead
 Must
2/22/08
be included in the measurement
12
Experiment

Performance Counter Experiment
 Pentium-M
uni-processor
 32KB
8-way L1 cache (data & instruction)
 2MB 4-way L2 cache
 2 hardware counter (18 if multiplexed)
 1GB
Memory
 32bit Linux 2.6.20 with perfctr patch
 PAPI 3.5.0 Library

Simulator Experiment
 PTLsim
(http://www.ptlsim.org) x86 simulator
 64bit AMD Athlon
2/22/08
13
Experiment

3 Production JVMs * 2 versions
 IBM
J9, Sun HotSpot JVM, JRockit (perfctr only)
 1.5 and 1.6
 Heap Size = max (16MB, 4*minimum heap size)

18 Benchmarks
9
DaCapo benchmarks
 8 SPEC JVM 98
 1 PseudoJBB
2/22/08
14
Experiment

40+ Metrics
 40
distinct metrics from performance counter
 L1
or L2 Cache misses (Instruction, Data, Read, Write)
 TLB-I miss
 Branch predictions
 Resource Stalls
 More
rich metrics from the simulator
 Micro
operation mix
 Load to store
2/22/08
15
Performance Counter Results (Cycle Counts)

PseudoJBB

jython
2/22/08


pmd
jess
16
Performance Counter Results (Cycle Counts)

jack

compress
2/22/08


hsqldb
db
17
Performance Counter Results

IBM J9 1.6 performed better than Sun
HotSpot 1.6 in the average
 JRockit has the most variation in performance

Full results
 ~800
graphs
 Full jython results in the paper
 http://z.cs.utexas.edu/users/habals/jvmcmp
 or Google my name (Jungwoo Ha)
2/22/08
18
Future Work

JVM activity characterization
 Garbage
collector
 JIT

Statistical analysis of performance metrics
 metrics
correlation
 Methodology to identify performance bottleneck

Multicore performance analysis
2/22/08
19
Conclusions

Methodology for production JVM comparison

Performance evaluation data

Simulator results for deeper analysis
2/22/08
20
Thanks you!
2/22/08
22
Simulation Result
2/22/08
23
Perfect Cache - compress
2/22/08
24
Perfect Cache - db
2/22/08
25

Microarchitectural Characterization of Production JVMs and

Transcript Microarchitectural Characterization of Production JVMs and

Directory