CS380 C lecture 20 • Last time – Linear scan register allocation – Classic compilation techniques – On to a modern context • Today – Jenn Sartor –

Transcript CS380 C lecture 20 • Last time – Linear scan register allocation – Classic compilation techniques – On to a modern context • Today – Jenn Sartor –

CS380 C
lecture 20
• Last time
– Linear scan register allocation
– Classic compilation techniques
– On to a modern context
• Today
– Jenn Sartor
– Experimental evaluation for managed
languages with JIT compilation and
garbage collection
1
Wake Up and Smell the Coffee:
Performance Analysis Methodologies
for the 21st Century
Kathryn S McKinley
Department of Computer Sciences
University of Texas at Austin
2
Shocking News!
In 2000, Java overtook C and C++ as the
most popular programming language
[TIOBE 2000--2008]
3
Systems Research
in Industry and Academia
ISCA 2006
20
5
2
2
1
papers use C and/or C++
papers are orthogonal to the programming language
papers use specialized programming languages
papers use Java and C from SPEC
paper uses only Java from SPEC
4
What is Experimental
Computer Science?
5
What is Experimental
Computer Science?
• An idea
• An implementation in some system
• An evaluation
6
The success of most systems innovation
hinges on evaluation methodologies.
1. Benchmarks reflect current and ideally,
future reality
2. Experimental design is appropriate
3. Statistical data analysis
7
The success of most systems innovation
hinges on experimental methodologies.

1. Benchmarks reflect current and ideally,
future reality
[DaCapo Benchmarks 2006]
2. Experimental design is appropriate.

3. Statistical Data Analysis [Georges et al. 2006]
8
Experimental Design
• We’re not in Kansas anymore!
– JIT compilation, GC, dynamic checks, etc
• Methodology has not adapted
– Needs to be updated and institutionalized
“…this sophistication provides a significant challenge to
understanding complete system performance, not found in
traditional languages such as C or C++” [Hauswirth et al OOPSLA ’0
9
Experimental Design
• Comprehensive comparison
–
–
–
–
3 state-of-the-art JVMs
Best of 5 executions
19 benchmarks
Platform: 2GHz Pentium-M, 1GB RAM, linux 2.6.15
10
Experimental Design
1.246
1.1
1.248
2.394
1.158
1.0
0.8
0.7
0.6
0.5
0.4
0.3
Sun JDK 16
0.2
IBM J9
0.1
BEA JRockit 16
n
g
e
o
m
e
a
la
n
x
a
su
n
fl
o
w
d
m
p
rc
h
lu
se
a
e
x
in
d
lu
o
n
jy
th
b
ld
h
sq
fo
p
se
e
cl
ip
rt
ch
a
t
lo
a
b
n
tl
r
0.0
a
Relative Performance
0.9
11
0.7
0.6
0.5
0.4
0.3
Sun JDK 16
IBM J9
0.0
BEA JRockit 16
rc
h
e
x
m
a
p
se
in
d
e
o
m
e
n
a
n
la
e
e
o
m
n
a
la
n
x
a
0.8
x
a
0.9
su
n
fl
o
w
1.0
su
n
fl
o
w
1.1
d
g
lu
lu
b
o
n
ld
jy
th
h
sq
2.394
g
e
x
a
rc
h
p
se
in
d
se
fo
p
cl
ip
1.248
m
d
lu
b
o
n
ld
jy
th
h
sq
e
1.246
lu
se
fo
p
cl
ip
t
rt
ch
a
lo
a
Relative Performance
1.1
e
rt
ch
a
0.1
t
0.2
b
0.1
lo
a
tl
r
n
a
0.2
b
n
tl
r
a
Relative Performance
Experimental Design
1.0
1.158
0.9
0.8
0.7
0.6
0.5
0.4
0.3
Sun JDK 16
IBM J9
0.0
BEA JRockit 16
12
0.9
0.8
0.7
0.6
0.5
0.4
0.3
Sun JDK 16
0.0
IBM J9
BEA JRockit 16
g
e
x
m
d
a
rc
h
p
se
in
d
e
e
o
m
n
a
n
e
o
m
e
n
a
n
la
x
a
1.0
x
a
la
1.1
su
n
fl
o
w
d
m
lu
lu
b
o
n
ld
jy
th
h
sq
g
d
rc
h
e
x
m
a
p
se
in
d
e
e
o
m
n
a
la
n
x
a
su
n
fl
o
w
lu
b
o
n
ld
jy
th
lu
se
fo
p
cl
ip
h
sq
e
2.394
g
e
x
rc
h
a
p
se
in
d
se
fo
p
cl
ip
1.248
su
n
fl
o
w
lu
b
o
n
ld
jy
th
h
sq
e
t
rt
ch
a
lo
a
1.246
lu
se
fo
p
cl
ip
rt
ch
a
t
b
Relative Performance
1.1
e
rt
0.1
ch
a
0.2
t
0.1
lo
a
0.2
b
tl
r
n
a
0.1
lo
a
n
tl
r
a
Relative Performance
0.2
b
tl
r
n
a
Relative Performance
Experimental Design
1.0
1.158
0.9
0.8
0.7
0.6
0.5
0.4
0.3
Sun JDK 16
IBM J9
0.0
BEA JRockit 16
1.1
1.0
0.9
0.8
0.7
0.6
0.5
0.4
0.3
Sun JDK 16
IBM J9
0.0
BEA JRockit 16
13
Experimental Design
First Iteration
1.246
1.1
1.0
Relative Performance
0.9
1.248
2.394
1.158
0.8
0.7
0.6
0.5
0.4
0.3
Sun JDK 16
0.2
IBM J9
0.1
BEA JRockit 16
n
e
o
m
e
a
la
n
x
a
su
n
fl
o
w
d
m
p
rc
h
lu
se
a
e
x
in
d
lu
o
n
jy
th
b
ld
1.0
0.9
0.8
g
Second Iteration
1.1
Relative Performance
h
sq
cl
ip
e
fo
p
se
rt
ch
a
t
lo
a
b
a
n
tl
r
0.0
0.7
0.6
0.5
0.4
0.3
Sun JDK 16
0.2
IBM J9
0.1
BEA JRockit 16
1.0
0.9
0.8
0.7
n
x
a
x
a
la
a
g
e
o
m
e
la
n
su
n
fl
o
w
su
n
fl
o
w
m
d
p
a
rc
h
lu
se
e
x
in
d
lu
o
n
jy
th
b
ld
Third Iteration
1.1
0.6
0.5
0.4
0.3
Sun JDK 16
0.2
IBM J9
0.1
BEA JRockit 16
g
e
o
m
e
a
n
n
d
m
p
rc
h
lu
se
a
e
x
in
d
lu
o
n
jy
th
b
ld
h
sq
fo
p
se
e
cl
ip
rt
ch
a
t
lo
a
b
n
tl
r
0.0
a
Relative Performance
h
sq
cl
ip
e
fo
p
se
rt
ch
a
t
lo
a
b
a
n
tl
r
0.0
14
Experimental Design
Another Experiment
• Compare two garbage collectors
– Semispace Full Heap Garbage Collector
– Marksweep Full Heap Garbage Collector
15
Experimental Design
Another Experiment
• Compare two garbage collectors
– Semispace Full Heap Garbage Collector
– Marksweep Full Heap Garbage Collector
• Experimental Design
–
–
–
–
–
Same JVM, same compiler settings
Second iteration for both
Best of 5 executions
One benchmark - SPEC 209_db
Platform: 2GHz Pentium-M, 1GB RAM, linux 2.6.15
16
Marksweep vs Semispace
-
Normalized Time
1 2 0 5 0 .1
1 0 4 2 8 .3 8
9 8 1 6 .1
9 6 8 8 .0 1
9 6 2 4 .9 7
9 5 9 9 .9 8
9 5 1 6 .3 2
9 4 9 0 .5 7
9 5 2 3 .9 7
9 4 1 2 .7 4
9 3 8 7 .9 7
9 4 4 3 .5
9 3 3 7 .7 8
SPEC _209_db Performance
1.35
1.3
1.25
1.2
1.15
1.1
Marksweep
Semispace
17
Marksweep vs Semispace
SPEC _209_db Performance
Normalized Time
1.2
1.15
1.1
1.05
1
0.95
Marksweep
Semispace
18
Marksweep vs Semispace
SPEC _209_db Performance
-
SPEC _209_db Performance
1.3
1.35
Semispace
Marksweep
1.3
1.2
1.15
1.1
Marksweep
Semispace
1.25
1.2
1.15
SPEC _209_db Performance
1.2
1.1
Normalized Time
1.25
Normalized Time
Normalized Time
1 2 0 5 0 .1
1 0 4 2 8 .3 8
9 8 1 6 .1
9 6 8 8 .0 1
9 6 2 4 .9 7
9 5 9 9 .9 8
9 5 1 6 .3 2
9 4 9 0 .5 7
9 5 2 3 .9 7
9 4 1 2 .7 4
9 3 8 7 .9 7
9 4 4 3 .5
9 3 3 7 .7 8
1.05
1
1.15
1.1
1.05
1
0.95
20
40
60
80
100
120
Marksweep
Semispace
Heap Size (MB)
19
Experimental Design
20
Experimental Design:
Best Practices
• Measuring JVM innovations
• Measuring JIT innovations
• Measuring GC innovations
• Measuring Architecture innovations
21
JVM Innovation
Best Practices
• Examples:
– Thread scheduling
– Performance monitoring
• Workload triggers differences
– real workloads & perhaps microbenchmarks
– e.g., force frequency of thread switching
• Measure & report multiple iterations
– start up
– steady state (aka server mode)
– never configure the VM to use completely unoptimized code!
• Use a modest or multiple heap sizes computed as a
function of maximum live size of the application
• Use & report multiple architectures
22
Performance relative to best
Best Practices
4.50
4.00
1st JVM A
2nd JVM A
3rd JVM A
1st JVM B
2nd JVM B
3rd JVM B
3.50
Pentium M
3.00
2.50
2.00
1.50
1.00
0.50
Performance relative to best
antlr
bloat
eclipse
fop
5.00
1st JVM A
2nd JVM A
3rd JVM A
4.50
1st JVM B
2nd JVM B
hsqldb
jython
lusearch
luindex
pmd
xalan
min
max
geomean
3rd JVM B
min
max
geomean
min
max
geomean
4.00
AMD Athlon
3.50
3.00
2.50
2.00
1.50
1.00
0.50
antlr
Performance relative to best
chart
5.50
bloat
chart
eclipse
fop
hsqldb
jython
lusearch
luindex
pmd
xalan
3.50
3.00
1st JVM A
2nd JVM A
3rd JVM A
1st JVM B
2nd JVM B
3rd JVM B
2.50
SPARC
2.00
1.50
1.00
0.50
antlr
bloat
chart
eclipse
fop
hsqldb
jython
lusearch
luindex
pmd
xalan
23
JIT Innovation
Best Practices
Example: new compiler optimization
–
–
–
–
Code quality: Does it improve the application code?
Compile time: How much compile time does it add?
Total time: compiler and application time together
Problem: adaptive compilation responds to
compilation load
– Question: How do we tease all these effects apart?
24
JIT Innovation
Best Practices
Teasing apart compile time and code quality
requires multiple experiments
• Total time: Mix methodology
– Run adaptive system as intended
• Result: mixture of optimized and unoptimized code
– First & second iterations (that include compile time)
– Set and/or report the heap size as a function of maximum live
size of the application
– Report: average and show statistical error
• Code quality
– OK: Run iterations until performance stabilizes on “best”, or
– Better: Run several iterations of the benchmark, turn off the
compiler, and measure a run guaranteed to have no compilation
– Best: Replay mix compilation
• Compile time
– Requires the compiler to be deterministic
– Replay mix compilation
25
Replay Compilation
Force the JIT to produce a deterministic result
• Make a compilation profiler & replayer
Profiler
– Profile first or later iterations with adaptive JIT, pick best or
average
– Record profiling information used in compilation decisions, e.g.,
dynamic profiles of edges, paths, &/or dynamic call graph
– Record compilation decisions, e.g., compile method bar at level
two, inline method foo into bar
– Mix of optimized and unoptimized, or all optimized/unoptimized
Replayer
– Reads in profile
– As the system loads each class, apply profile +/- innovation
• Result
– controlled experiments with deterministic compiler behavior
– reduces statistical variance in measurements
• Still not a perfect methodology for inlining
26
GC Innovation
Best Practices
• Requires more than one experiment...
• Use & report a range of fixed heap sizes
– Explore the space time tradeoff
– Measure heap size with respect to the maximum live size of the
application
– VMs should report total memory not just application memory
• Different GC algorithms vary in the meta-data they require
• JIT and VM use memory...
• Measure time with a constant workload
– Do not measure through put
• Best: run two experiments
– mix with adaptive methodology: what users are likely to see in practice
– replay: hold the compiler activity constant
• Choose a profile with “best” application performance in order to keep from
hiding mutator overheads in bad code.
27
Architecture Innovation
Best Practices
• Requires more than one experiment...
• Use more than one VM
• Set a modest heap size and/or report heap size as a function of
maximum live size
• Use a mixture of optimized and uncompiled code
• Simulator needs the “same” code in many cases to perform
comparisons
• Best for microarchitecture only changes:
– Multiple traces from live system with adaptive methodology
• start up and steady state with compiler turned off
• what users are likely to see in practice
• Wont work if architecture change requires recompilation, e.g., new
sampling mechanism
– Use replay to make the code as similar as possible
28
“…improves throughput by up to 41x”
“speed up by 10-25% in many cases…”
“…about 2x in two cases…”
“…more than 10x in two small benchmarks
“speedups of 1.2x to 6.4x on a variety of benchmarks”
“can reduce garbage collection time by 50% to 75%”
“…demonstrating high efficiency and scalability
“our prototype has usable performance”
benchmar
There are lies, damn lies, statistics
and
“sometimes more than twice as fast”
ks
“our algorithm is highly efficient”
Disraeli
“garbage collection degrades performance by 70%”
“speedups…. are very significant (up to 54-fold
“our …. is better or almost as good as …. across the board”
“the overhead …. is on average negligible”
Quotes from recent research papers
29
Conclusions
• Methodology includes
– Benchmarks
– Experimental design
– Statistical analysis [OOPSLA 2007]
• Poor Methodology
– can focus or misdirect innovation and energy
• We have a unique opportunity
– Transactional memory, multicore performance, dynamic languages
• What we can do
– Enlist VM builders to include replay
– Fund and broaden participation in benchmarking
• Research and industrial partnerships
• Funding through NSF, ACM, SPEC, industry or ??
– Participate in building community workloads
30
CS380 C
• More on Java Benchmarking
– www.dacapobench.org
– Alias analysis
• Read: A. Diwan, K. S. McKinley, and J. E. B. Moss,
Using Types to Analyze and Optimize Object-Oriented
Programs, ACM Transactions on Programming
Languages and Systems, 23(1): 30-72, January 2001.
31
Suggested Readings
Performance Evaluation of JVMs
•
How Java Programs Interact with Virtual Machines at the
Microarchitectural Level, Lieven Eeckhout, Andy Georges and Koen De Bosschere,
The 18th Annual ACM SIGPLAN Conference on Object-Oriented Programming,
Systems, Languages and Applications (OOPSLA'03), Oct. 2003
•
Method-Level Phase Behavior in Java Workloads, Andy Georges, Dries
Buytaert, Lieven Eeckhout and Koen De Bosschere, The 19th Annual ACM SIGPLAN
Conference on Object-Oriented Programming, Systems, Languages and Applications
(OOPSLA'04), Oct. 2004
•
Myths and Realities: The Performance Impact of Garbage Collection, S. M.
Blackburn, P. Cheng, and K. S. McKinley, ACM SIGMETRICS Conference on Measurement &
Modeling Computer Systems, pp. 25--36, New York, NY, June 2004.
•
The DaCapo Benchmarks: Java Benchmarking Development and Analysis, S. M.
Blackburn, et. al., The ACM SIGPLAN Conference on Object Oriented Programming
Systems, Languages and Applications (OOPSLA), Portland, OR, pp. 191--208, October 2006.
•
Statistically Rigorous Java Performance Evaluation, A. Georges, D. Buytaert, and L.
Eeckhout, The ACM SIGPLAN Conference on Object Oriented Programming Systems,
Languages and Applications (OOPSLA), Montreal, Canada, Oct 2007. To appear.
32

CS380 C lecture 20 • Last time – Linear scan register allocation – Classic compilation techniques – On to a modern context • Today – Jenn Sartor –

Transcript CS380 C lecture 20 • Last time – Linear scan register allocation – Classic compilation techniques – On to a modern context • Today – Jenn Sartor –

Directory