www.lst.inf.ethz.ch
Download
Report
Transcript www.lst.inf.ethz.ch
Performance Optimizations
for NUMA-Multicore Systems
Zoltán Majó
Department of Computer Science
ETH Zurich, Switzerland
About me
ETH Zurich: research assistant
Research: performance optimizations
Assistant: lectures
TUCN
Student
Communications Center: network engineer
Department of Computer Science: assistant
2
Computing
Unlimited need for performance
3
Performance optimizations
One goal: make programs run fast
Idea: pick good algorithm
Reduce number of operations executed
Example: sorting
4
Sorting
Execution time [T]
180
160
140
120
Number of operations
100
80
n^2
n*log(n)
60
40
20
0
1
2
3
4
5 6 7 8
Input size (n)
9
10 11 12
5
Sorting
Execution time [T]
180
160
140
120
Number of operations
100
80
Polynomial (n^2)
60
Column1
40
20
0
1
2
3
4
5 6 7 8
Input size (n)
9
10 11 12
6
Sorting
Execution time [T]
180
160
140
120
Number of operations
100
80
Polynomial (n^2)
60
n*log(n)
40
20
0
1
2
3
4
5 6 7 8
Input size (n)
9
10 11 12
7
Sorting
Execution time [T]
180
160
140
120
Number of operations
100
Polynomial (n^2)
80
11X
faster
60
40
n*log(n)
20
0
1
2
3
4
5 6 7 8
Input size (n)
9
10 11 12
8
Sorting
We picked good algorithm, work done
Are we really done?
Make sure our algorithm runs fast
Operations take time
We assumed 1 operation = 1 time unit T
9
Quicksort performance
Execution time [T]
200
180
160
140
120
100
80
60
40
20
0
1
2
3
4
5 6 7 8
Input size (n)
9
10 11 12
10
Quicksort performance
Execution time [T]
200
180
160
140
120
100
80
60
40
20
0
1
2
3
4
5 6 7 8
Input size (n)
9
10 11 12
11
Quicksort performance
Execution time [T]
200
180
160
140
120
100
80
60
40
20
0
1
2
3
4
5 6 7 8
Input size (n)
9
10 11 12
12
Quicksort performance
Execution time [T]
200
180
160
140
120
100
80
60
40
20
0
1
2
3
4
5 6 7 8
Input size (n)
9
10 11 12
13
Quicksort performance
Execution time [T]
200
180
32%
faster
160
140
120
100
80
60
40
20
0
1
2
3
4
5 6 7 8
Input size (n)
9
10 11 12
14
Latency of operations
Best algorithm not enough
Operations are executed on hardware
Stage 1:
Dispatch
operation
Stage 2:
Execute
operation
Stage 3:
Retire
operation
CPU
15
Latency of operations
Best algorithm not enough
Operations are executed on hardware
Stage 1:
Dispatch
operation
Stage 2:
Execute
operation
Stage 3:
Retire
operation
CPU
Hardware must be used efficiently
16
Outline
Introduction: performance optimizations
Cache-aware programming
Scheduling on multicore processors
Using run-time feedback
Data locality optimizations on NUMA-multicores
Conclusion
ETH scholarship
17
Memory accesses
CPU
230 cycles access latency
RAM
18
Memory accesses
CPU
Total access latency = 16
? x 230 cycles = 3680 cycles
230 cycles access latency
RAM
19
Caching
CPU
230 cycles access latency
RAM
20
Caching
CPU
Block size:
30 cycles access latency
Cache
200 cycles access latency
RAM
21
Caching
CPU
Block size:
30 cycles access latency
Cache
200 cycles access latency
RAM
22
Caching
CPU
Block size:
30 cycles access latency
Cache
200 cycles access latency
RAM
23
Caching
CPU
Block size:
30 cycles access latency
Cache
200 cycles access latency
RAM
24
Caching
CPU
Block size:
30 cycles access latency
Cache
200 cycles access latency
RAM
25
Hits and misses
CPU
Cache miss: data not in cache = 230 cycles
Cache hit: data in cache = 30 cycles
30 cycles access latency
Cache
200 cycles access latency
RAM
26
Total access latency
CPU
Total access latency = 4
? misses + 12 hits
= 4 x 230 cycles + 12 * 30 cycles = 1280 cycles
30 cycles access latency
Cache
200 cycles access latency
RAM
27
Benefits of caching
Comparison
Architecture w/o cache: T = 230 cycles
Architecture w/ cache: Tavg = 80 cycles → 2.7X improvement
Do caches always help?
Can you think of access pattern with bad cache usage?
28
Caching
CPU
Block size:
35 cycles access latency
Cache
200 cycles access latency
RAM
29
Cache-aware programming
Today’s example: matrix-matrix multiplication (MMM)
Number of operations: n3
Compare naïve and optimized implementation
Same number of operations
30
MMM: naïve implementation
j
C
i
j
=
A
X
B
i
for (i=0; i<N; i++)
for (j=0; j<N; j++) {
sum = 0.0;
for (k=0; k < N; k++)
sum += A[i][k]*B[k][j];
C[i][j] = sum;
}
31
MMM
CPU
Cache hits
A[][]
B[][]
?
?
Total accesses
4
4
30 cycles access latency
Cache
200 cycles access latency
RAM
C
A
B
32
MMM
CPU
Cache hits
A[][]
B[][]
3?
?
Total accesses
4
4
30 cycles access latency
Cache
200 cycles access latency
RAM
C
A
B
33
MMM
CPU
Cache hits
A[][]
B[][]
3?
?
Total accesses
4
4
30 cycles access latency
Cache
200 cycles access latency
RAM
C
A
B
34
MMM
CPU
Cache hits
A[][]
B[][]
3?
?
Total accesses
4
4
30 cycles access latency
Cache
200 cycles access latency
RAM
C
A
B
35
MMM
CPU
Cache hits
A[][]
B[][]
3?
0?
Total accesses
4
4
30 cycles access latency
Cache
200 cycles access latency
RAM
C
A
B
36
MMM: Cache performance
Hit rate
Accesses to A[][]: 3/4 = 75%
Accesses to B[][]: 0/4 = 0%
All accesses: 38%
Can we do better?
37
Cache-friendly MMM
Cache-unfriendly MMM (ijk)
Cache-friendly MMM (ikj)
for (i=0; i<N; i++)
for (j=0; j<N; j++) {
sum = 0.0;
for (k=0; k < N; k++)
sum += A[i][k]*B[k][j];
C[i][j] += sum;
}
for (i=0; i<N; i++)
for (k=0; k<N; k++) {
r = A[i][k];
for (j=0; j < N; j++)
C[i][j] += r*B[k][j];
}
k
C
i
=
A
X
k
B
i
38
MMM
CPU
Cache hits
C[][]
B[][]
3?
3?
Total accesses
4
4
30 cycles access latency
Cache
200 cycles access latency
RAM
C
A
B
39
Cache-friendly MMM
Cache-unfriendly MMM (ijk)
Cache-friendly MMM (ikj)
A[][]: 3/4 = 75% hit rate
C[][]: 3/4 = 75% hit rate
B[][]: 0/4 = 0% hit rate
B[][]: 3/4 = 75% hit rate
All accesses: 38% hit rate
All accesses: 75% hit rate
Better performance due to cache-friendliness?
40
Performance of MMM
Execution time [s]
10000
1000
100
10
1
0.1
0.01
512
1024
2048
Matrix size
ijk (cache-unfriendly)
4096
8192
ikj (cache-friendly)
41
Performance of MMM
Execution time [s]
10000
1000
20X
100
10
1
0.1
0.01
512
1024
2048
Matrix size
ijk (cache-unfriendly)
4096
8192
ikj (cache-friendly)
42
Cache-aware programming
Two versions of MMM: ijk and ikj
Same number of operations (~n3)
ikj 20X better than ijk
Good performance depends on two aspects
Good algorithm
Implementation that takes hardware into account
Hardware
Many possibilities for inefficiencies
We consider only the memory system in this lecture
43
Outline
Introduction: performance optimizations
Cache-aware programming
Scheduling on multicore processors
Using run-time feedback
Data locality optimizations on NUMA-multicores
Conclusions
ETH scholarship
44
Cache-based architecture
CPU
10 cycles access latency
L1-C
20 cycles access latency
Cache
L2 Cache
Bus Controller
200 cycles access latency
Memory Controller
RAM
45
Multi-core multiprocessor
Processor package
Processor package
Core
CPU
Core
Core
Core
Core
CPU
Core
Core
Core
L1-C
L1-C
L1-C
L1-C
L1-C
L1-C
L1-C
L1-C
Cache
L2 Cache
Bus Controller
L2 Cache
Cache
L2 Cache
L2 Cache
Bus Controller
Memory Controller
RAM
46
Experiment
Performance of a well-optimized program
soplex from SPECCPU 2006
soplex
Multicore-multiprocessor systems are parallel
Multiple programs run on the system simultaneously
Contender program: milc from SPECCPU 2006 milc
Examine 4 execution scenarios
47
Execution scenarios
Processor 0
Processor 1
soplex
Core
CPU
milc
Core
Core
Core
Core
Core
Core
Core
L1-C
L1-C
L1-C
L1-C
L1-C
L1-C
L1-C
L1-C
L2 Cache
Bus Controller
L2 Cache
Cache
L2 Cache
L2 Cache
Bus Controller
Memory Controller
RAM
48
Execution scenarios
Processor 0
Processor 1
soplex
Core
CPU
milc
Core
Core
Core
Core
Core
Core
Core
L1-C
L1-C
L1-C
L1-C
L1-C
L1-C
L1-C
L1-C
L2 Cache
Bus Controller
L2 Cache
Cache
L2 Cache
L2 Cache
Bus Controller
Memory Controller
RAM
49
Performance with sharing: soplex
Execution time relative to solo execution
2.0
1.8
1.6
1.4
1.2
1.0
0.8
0.6
0.4
0.2
0.0
Solo
Shared cache, bus
Shared bus
controller, memory controller, memory
controller
controller
Shared memory
controller
50
Performance with sharing: soplex
Execution time relative to solo execution
2.0
1.8
1.6
1.4
1.2
1.0
0.8
0.6
0.4
0.2
0.0
Solo
Shared cache, bus
Shared bus
controller, memory controller, memory
controller
controller
Shared memory
controller
51
Performance with sharing: soplex
Execution time relative to solo execution
2.0
1.8
1.6
1.4
1.2
1.0
0.8
0.6
0.4
0.2
0.0
Solo
Shared cache, bus
Shared bus
controller, memory controller, memory
controller
controller
Shared memory
controller
52
Resource sharing
Significant slowdowns due to resource sharing
Why is resource sharing so bad?
Example: cache sharing
53
Cache sharing
Core
soplex
Core
milc
L1 Cache
RAM
54
Cache sharing
soplex
Core
milc
Core
L1 Cache
RAM
55
Resource sharing
Does resource sharing affect all programs?
So far: we considered at the performance of
Let us consider a different program: namd
soplex under
contention
56
Performance with sharing
Execution time relative to solo execution
2.0
1.8
1.6
1.4
1.2
1.0
0.8
0.6
0.4
0.2
0.0
Solo
Shared cache,
Shared bus
bus controller,
controller,
memory
memory
controller
controller
soplex
namd
Shared memory
controller
57
Performance with sharing
Execution time relative to solo execution
2.0
1.8
1.6
1.4
1.2
1.0
0.8
0.6
0.4
0.2
0.0
Solo
Shared cache,
Shared bus
bus controller,
controller,
memory
memory
controller
controller
soplex
namd
Shared memory
controller
58
Resource sharing
Significant slowdown for some programs
soplex
namd
affected significantly
affected less
What do we do about it?
Scheduling can help
Example workload: soplex
namd
soplex soplex soplex
namd
namd
namd
59
Execution scenarios
Processor 0
soplex
soplex
Core
Core
CPU
L1-C
L1-C
L2 Cache
Bus Controller
Processor 0
soplex
Core soplex
Core
L1-C
L2 Cache
L1-C
namd
Core
namd
Core
namd
Core
namd
Core
L1-C
L1-C
L1-C
L1-C
Cache
L2 Cache
L2 Cache
Bus Controller
Memory Controller
RAM
60
Execution scenarios
Processor 0
soplex
soplex
Core
Core
CPU
L1-C
L1-C
L2 Cache
Bus Controller
Processor 0
soplex
Core soplex
Core
L1-C
L2 Cache
L1-C
namd
Core
namd
Core
namd
Core
namd
Core
L1-C
L1-C
L1-C
L1-C
Cache
L2 Cache
L2 Cache
Bus Controller
Memory Controller
RAM
61
Challenges for a scheduler
Programs have different behaviors
Behavior not known ahead-of-time
soplex
vs.
namd
Behavior changes over time
62
Single-phased program
63
Program with multiple phases
64
Outline
Introduction: performance optimizations
Cache-aware programming
Scheduling on multicore processors
Using run-time feedback
Data locality optimizations on NUMA-multicores
Conclusions
ETH scholarship
65
Hardware performance counters
Special registers
Programmable to monitor given hardware event (e.g., cache misses)
Low-level information about hardware-software interaction
Low overhead due to hardware implementation
In the past: undocumented feature
Since Intel Pentium: publicly available description
Debugging tools: Intel VTune, Intel PTU, AMD CodeAnalyst
66
Programming performance counters
Model-specific registers
Access: RDMSR, WRMSR, and RDPMC instructions
Ring 0 instructions (available only in kernel-mode)
perf_events interface
Standard Linux interface since Linux 2.6.31
UNIX philosophy: performance counters are files
Simple API:
Set up counters: perf_event_open()
Read counters as files
67
Example: monitoring cache misses
int main() {
int pid = fork();
if (pid == 0) {
exit(exec(“./my_program”, NULL));
} else {
int status; uint64_t value;
int fd = perf_event_open(...);
waitpid(pid, &status, NULL);
read(fd, &value, sizeof(uint64_t);
printf(”Cache misses: %”PRIu64”\n”, value);
}
}
68
perf_event_open()
Looks simple
int sys_perf_event_open(
struct perf_event_attr
struct perf_event_attr *hw_event_uptr,
__u32 type;
pid_t pid,
int cpu,
int group_fd,
unsigned long flags
);
{
__u32 size;
__u64 config;
union {
__u64 sample_period;
__u64 sample_freq;
};
__u64 sample_type;
__u64 read_format;
__u64 inherit;
__u64 pinned;
__u64 exclusive;
__u64 exclude_user;
__u64 exclude_kernel;
__u64 exclude_hv;
__u64 exclude_idle;
__u64 mmap;
69
libpfm
Open-source helper library
(3) call perf_event_open()
(1) event name
libpfm
user program
(2) set up perf_event_attr
perf_events
(4) read results
70
Example: measure cache misses for MMM
Determine microarchitecture
Intel Xeon E5520: Nehalem microarchitecture
Look up event needed
Source: Intel Architectures Software Developer's Manual
71
Software Developer’s Manual
72
Example: measure cache misses for MMM
Determine microarchitecture
Intel Xeon E5520: Nehalem microarchitecture
Look up event needed
Source: Intel Architectures Software Developer's Manual
Event name: OFFCORE_RESPONSE_0:ANY_REQUEST:ANY_DRAM
73
MMM cache misses
Millions
# cache misses x 106
1000000
100000
10000
1000
100
10
1
0.1
0.01
0.001
0.0001
0.00001
0.000001
30X
512
1024
2048
Matrix size
ijk (cache-unfriendly)
4096
8192
ikj (cache-friendly)
74
Single-phased program
set up performance counters
read performance counters
75
Program with multiple phases
set up performance counters
get sample
76
Membus: multicore scheduler
1. Dynamically determine program behavior
Measure # of loads/stores that cause memory traffic
Hardware performance counters in sampling mode
2. Determine optimal placement based on measurements
77
Evaluation
Workload with 8 processes
lbm, soplex, gromacs, hmmer from SPEC CPU 2006
Two instances of each program
Experimental results
78
Evaluation
Execution time relative to solo execution
3.0
2.5
2.0
Default Linux
1.5
Membus
1.0
0.5
0.0
lbm
soplex
gromacs
hmmer
Average
79
Evaluation
Execution time relative to solo execution
3.0
2.5
2.0
Default Linux
1.5
Membus
1.0
0.5
0.0
lbm
soplex
gromacs
hmmer
Average
80
Evaluation
Execution time relative to solo execution
3.0
2.5
16%
2.0
8%
Default Linux
1.5
Membus
1.0
0.5
0.0
lbm
soplex
gromacs
hmmer
Average
81
Summary: multicore processors
Resource sharing critical for performance
Membus: a scheduler that reduces resource sharing
Question: why wasn’t Membus able to improve more?
82
Memory controller sharing
Processor 0
Processor 1
soplex
Core
CPU
namd
Core
soplex
Core
namd
Core
namd
Core
L1-C
L1-C
L1-C
L1-C
L1-C
L2 Cache
Bus Controller
L2 Cache
soplex
Core
L1-C
Cache
L2 Cache
namd
Core
soplex
Core
L1-C
L1-C
L2 Cache
Bus Controller
Memory Controller
RAM
83
Non-uniform memory architecture
Processor 0
Processor 1
Core
CPU
Core
Core
Core
Core
Core
Core
Core
L1-C
L1-C
L1-C
L1-C
L1-C
L1-C
L1-C
L1-C
L2 Cache
Bus Controller
L2 Cache
Cache
L2 Cache
L2 Cache
Bus Controller
Memory Controller
RAM
RAM
84
Non-uniform memory architecture
Processor 0
Processor 1
Core
CPU
Core
Core
Core
Core
Core
Core
Core
L1-C
L1-C
L1-C
L1-C
L1-C
L1-C
L1-C
L1-C
L2 Cache
L2 Cache
Cache
L2 Cache
L2 Cache
Memory Ctrl
Interconnect
Interconnect
Memory Ctrl
RAM
RAM
85
Outline
Introduction: performance optimizations
Cache-aware programming
Scheduling on multicore processors
Using run-time feedback
Data locality optimizations on NUMA-multicores
Conclusions
ETH scholarship
86
Non-uniform memory architecture
Processor 0
Processor 1
Core 0
Core 1
Core 4
Core 5
Core 2
Core 3
Core 6
Core 7
MC
IC
IC
MC
DRAM
DRAM
87
Non-uniform memory architecture
Processor 0
Local memory accesses
Processor 1
Core
T 0
Core 1
Core 4
Core 5
Core 2
Core 3
Core 6
Core 7
MC
IC
IC
MC
DRAM
bandwidth: 10.1 GB/s
latency: 190 cycles
DRAM
Data
All data based on experimental evaluation of Intel Xeon 5500 (Hackenberg [MICRO ’09], Molka [PACT ‘09]) 88
Non-uniform memory architecture
Processor 0
Local memory accesses
Processor 1
Core
T 0
Core 1
Core 4
Core 5
Core 2
Core 3
Core 6
Core 7
MC
IC
IC
MC
DRAM
DRAM
bandwidth: 10.1 GB/s
latency: 190 cycles
Remote memory accesses
bandwidth: 6.3 GB/s
latency: 310 cycles
Data
Key to good performance: data locality
All data based on experimental evaluation of Intel Xeon 5500 (Hackenberg [MICRO ’09], Molka [PACT ‘09]) 89
Data locality in multithreaded programs
Remote memory references / total memory references [%]
60%
50%
40%
30%
20%
10%
0%
cg. B
lu.C
ft.B
ep.C
bt.B
sp.B
NAS Parallel Benchmarks
is.B
mg.C
90
Data locality in multithreaded programs
Remote memory references / total memory references [%]
60%
50%
40%
30%
20%
10%
0%
cg. B
lu.C
ft.B
ep.C
bt.B
sp.B
NAS Parallel Benchmarks
is.B
mg.C
91
First-touch page placement policy
Processor 0
Processor 1
T0
DRAM
T1
DRAM
P0
R/W
92
First-touch page placement policy
Processor 0
Processor 1
T0
DRAM
T1
DRAM
P0
P1
R/W
93
Automatic page placement
First-touch page placement
Often high number of remote accesses
Data address profiling
Profile-based page-placement
Supported by hardware performance counters many architectures
94
Profile-based page placement
Based on the work of Marathe et al. [JPDC 2010, PPoPP 2006]
Processor 0
Processor 1
T0
DRAM
Profile
T1
P0
P0
: accessed 1000 times by
T0
P1
P1
: accessed 3000 times by
T1
DRAM
95
Automatic page placement
Compare: first-touch and profile-based page placement
Machine: 2-processor 8-core Intel Xeon E5520
Subset of NAS PB: programs with high fraction of remote accesses
8 threads with fixed thread-to-core mapping
96
Profile-based page placement
Performance improvement over first-touch [%]
25%
20%
15%
10%
5%
0%
cg.B
lu.C
bt.B
ft.B
sp.B
97
Profile-based page placement
Performance improvement over first-touch [%]
25%
20%
15%
10%
5%
0%
cg.B
lu.C
bt.B
ft.B
sp.B
98
Profile-based page placement
Performance improvement over first-touch in some cases
No performance improvement in many cases
Why?
99
Inter-processor data sharing
Processor 0
Processor 1
T0
Profile
T1
DRAM
DRAM
P0
P1
P0
: accessed 1000 times by
T0
P1
: accessed 3000 times by
T1
P2P2
: accessed 4000 times by
T0
accessed 5000 times by
T1
P2: inter-processor shared
100
Inter-processor data sharing
Processor 0
Processor 1
T0
T1
DRAM
P2
Profile
DRAM
P0
P1
P0
: accessed 1000 times by
T0
P1
: accessed 3000 times by
T1
P2
: accessed 4000 times by
T0
accessed 5000 times by
T1
P2: inter-processor shared
101
Inter-processor data sharing
Shared heap / total heap [%]
60%
50%
40%
30%
20%
10%
0%
cg.B
lu.C
bt.B
ft.B
sp.B
Inter-processor shared heap relative to total heap
102
Inter-processor data sharing
Shared heap / total heap [%]
60%
50%
40%
30%
20%
10%
0%
cg.B
lu.C
bt.B
ft.B
sp.B
Inter-processor shared heap relative to total heap
103
Inter-processor data sharing
Shared heap / total heap [%]
Performance improvement [%]
60%
30%
50%
25%
40%
20%
30%
15%
20%
10%
10%
5%
0%
0%
cg.B
lu.C
bt.B
ft.B
sp.B
Inter-processor shared heap relative to total heap
Performance improvement over first-touch
104
Inter-processor data sharing
Shared heap / total heap [%]
Performance improvement [%]
60%
30%
50%
25%
40%
20%
30%
15%
20%
10%
10%
5%
0%
0%
cg.B
lu.C
bt.B
ft.B
sp.B
Inter-processor shared heap relative to total heap
Performance improvement over first-touch
105
Automatic page placement
Profile-based page placement often ineffective
Reason: inter-processor data sharing
Inter-processor data sharing is a program property
We propose program transformations
No time for details now, see results
106
Evaluation
Performance improvement over first-touch [%]
25%
20%
15%
10%
5%
0%
cg.B
lu.C
Profile-based allocation
bt.B
ft.B
sp.B
Program transformations
107
Evaluation
Performance improvement over first-touch [%]
25%
20%
15%
10%
5%
0%
cg.B
lu.C
Profile-based allocation
bt.B
ft.B
sp.B
Program transformations
108
Conclusions
Performance optimizations
Good algorithm + hardware-awareness
Example: cache-aware matrix multiplication
Hardware awareness
Resource sharing in multicore processors
Data placement in non-uniform memory architectures
A lot remains to be done...
...and you can be part of it!
109
ETH scholarship for masters students...
...to work on their master thesis
In the Laboratory of Software Technology
Prof. Thomas R. Gross
PhD. Stanford University, MIPS project, supervisor John L. Hennessy
Carnegie Mellon: Warp, iWarp, Fx projects
ETH offers to you
Monthly scholarship of CHF 1500– 1700 (EUR 1200–1400)
Assistance with finding housing
Thesis topic
110
Possible Topics
Michael Pradel: Automatic bug finding
Luca Della Toffola: Performance optimizations for Java
Me: Hardware-aware performance optimizations
111
OO code positioning
Call Graph
A
B
C
D
E
Cache
A
B
A
B
C
…
Memory
…
C
D
E
…
Call Graph
A
B
C
D
E
Profiling
Hot Path
Call Graph
A
B
C
D
E
Cache
A
B
C
…
Miss
Memory
…
A
B
C
D
E
…
Call Graph
A
B
C
D
E
• JVM
• No Profiling
• Constructors
Cache
A
B
E
…
Hit
Memory
…
A
B
E
D
C
…
• Linked list traversal
• Looking for the youngest/oldest person
Person
Person
Person
Person
next
next
next
next
name
name
name
name
surname
surname
surname
surname
age
age
age
age
null
Cache
next
name surname age
next
name surname age
next
name surname age
next
name surname age
next
name surname age
next
name surname age
Cache
next
name surname age
next
name surname age
next
name surname age
next
name surname age
next
name surname age
next
name surname age
Cache
next
age
next
age
next
age
next
age
next
age
next
age
next
age
next
age
next
age
next
age
next
age
next
age
•
•
•
•
Jikes RVM
Splitting strategies
Garbage collection optimizations
Allocation optimizations
# field accesses
A
a1
a2
Profiling
A
A
a3
a1: 10
a5
a2: 100
a3
a3: 1000
a4
a4: 30
a5
a5: 2000
Splitting
Class$Cold
A$Cold
a1
a2
a4
hot field
cold field
If interested and motivated
Apply
@ Prof. Rodica Potolea
Until August 2012
Come to Zurich
Start in February 2013
Work 4-6 months on the thesis
If you have questions
Send e-mail to me [email protected]
Talk to Prof. Rodica Potolea
121
Thank you for your attention!
122