www.lst.inf.ethz.ch

Download Report

Transcript www.lst.inf.ethz.ch

Performance Optimizations
for NUMA-Multicore Systems
Zoltán Majó
Department of Computer Science
ETH Zurich, Switzerland
About me
 ETH Zurich: research assistant
 Research: performance optimizations
 Assistant: lectures
 TUCN
 Student
 Communications Center: network engineer
 Department of Computer Science: assistant
2
Computing
Unlimited need for performance
3
Performance optimizations
 One goal: make programs run fast
 Idea: pick good algorithm
 Reduce number of operations executed
 Example: sorting
4
Sorting
Execution time [T]
180
160
140
120
Number of operations
100
80
n^2
n*log(n)
60
40
20
0
1
2
3
4
5 6 7 8
Input size (n)
9
10 11 12
5
Sorting
Execution time [T]
180
160
140
120
Number of operations
100
80
Polynomial (n^2)
60
Column1
40
20
0
1
2
3
4
5 6 7 8
Input size (n)
9
10 11 12
6
Sorting
Execution time [T]
180
160
140
120
Number of operations
100
80
Polynomial (n^2)
60
n*log(n)
40
20
0
1
2
3
4
5 6 7 8
Input size (n)
9
10 11 12
7
Sorting
Execution time [T]
180
160
140
120
Number of operations
100
Polynomial (n^2)
80
11X
faster
60
40
n*log(n)
20
0
1
2
3
4
5 6 7 8
Input size (n)
9
10 11 12
8
Sorting
 We picked good algorithm, work done
 Are we really done?
 Make sure our algorithm runs fast
 Operations take time
 We assumed 1 operation = 1 time unit T
9
Quicksort performance
Execution time [T]
200
180
160
140
120
100
80
60
40
20
0
1
2
3
4
5 6 7 8
Input size (n)
9
10 11 12
10
Quicksort performance
Execution time [T]
200
180
160
140
120
100
80
60
40
20
0
1
2
3
4
5 6 7 8
Input size (n)
9
10 11 12
11
Quicksort performance
Execution time [T]
200
180
160
140
120
100
80
60
40
20
0
1
2
3
4
5 6 7 8
Input size (n)
9
10 11 12
12
Quicksort performance
Execution time [T]
200
180
160
140
120
100
80
60
40
20
0
1
2
3
4
5 6 7 8
Input size (n)
9
10 11 12
13
Quicksort performance
Execution time [T]
200
180
32%
faster
160
140
120
100
80
60
40
20
0
1
2
3
4
5 6 7 8
Input size (n)
9
10 11 12
14
Latency of operations
 Best algorithm not enough
 Operations are executed on hardware
Stage 1:
Dispatch
operation
Stage 2:
Execute
operation
Stage 3:
Retire
operation
CPU
15
Latency of operations
 Best algorithm not enough
 Operations are executed on hardware
Stage 1:
Dispatch
operation
Stage 2:
Execute
operation
Stage 3:
Retire
operation
CPU
 Hardware must be used efficiently
16
Outline
 Introduction: performance optimizations
 Cache-aware programming
 Scheduling on multicore processors
 Using run-time feedback
 Data locality optimizations on NUMA-multicores
 Conclusion
 ETH scholarship
17
Memory accesses
CPU
230 cycles access latency
RAM
18
Memory accesses
CPU
Total access latency = 16
? x 230 cycles = 3680 cycles
230 cycles access latency
RAM
19
Caching
CPU
230 cycles access latency
RAM
20
Caching
CPU
Block size:
30 cycles access latency
Cache
200 cycles access latency
RAM
21
Caching
CPU
Block size:
30 cycles access latency
Cache
200 cycles access latency
RAM
22
Caching
CPU
Block size:
30 cycles access latency
Cache
200 cycles access latency
RAM
23
Caching
CPU
Block size:
30 cycles access latency
Cache
200 cycles access latency
RAM
24
Caching
CPU
Block size:
30 cycles access latency
Cache
200 cycles access latency
RAM
25
Hits and misses
CPU
Cache miss: data not in cache = 230 cycles
Cache hit: data in cache = 30 cycles
30 cycles access latency
Cache
200 cycles access latency
RAM
26
Total access latency
CPU
Total access latency = 4
? misses + 12 hits
= 4 x 230 cycles + 12 * 30 cycles = 1280 cycles
30 cycles access latency
Cache
200 cycles access latency
RAM
27
Benefits of caching
 Comparison
 Architecture w/o cache: T = 230 cycles
 Architecture w/ cache: Tavg = 80 cycles → 2.7X improvement
 Do caches always help?
 Can you think of access pattern with bad cache usage?
28
Caching
CPU
Block size:
35 cycles access latency
Cache
200 cycles access latency
RAM
29
Cache-aware programming
 Today’s example: matrix-matrix multiplication (MMM)
 Number of operations: n3
 Compare naïve and optimized implementation
 Same number of operations
30
MMM: naïve implementation
j
C
i
j
=
A
X
B
i
for (i=0; i<N; i++)
for (j=0; j<N; j++) {
sum = 0.0;
for (k=0; k < N; k++)
sum += A[i][k]*B[k][j];
C[i][j] = sum;
}
31
MMM
CPU
Cache hits
A[][]
B[][]
?
?
Total accesses
4
4
30 cycles access latency
Cache
200 cycles access latency
RAM
C
A
B
32
MMM
CPU
Cache hits
A[][]
B[][]
3?
?
Total accesses
4
4
30 cycles access latency
Cache
200 cycles access latency
RAM
C
A
B
33
MMM
CPU
Cache hits
A[][]
B[][]
3?
?
Total accesses
4
4
30 cycles access latency
Cache
200 cycles access latency
RAM
C
A
B
34
MMM
CPU
Cache hits
A[][]
B[][]
3?
?
Total accesses
4
4
30 cycles access latency
Cache
200 cycles access latency
RAM
C
A
B
35
MMM
CPU
Cache hits
A[][]
B[][]
3?
0?
Total accesses
4
4
30 cycles access latency
Cache
200 cycles access latency
RAM
C
A
B
36
MMM: Cache performance
 Hit rate
 Accesses to A[][]: 3/4 = 75%
 Accesses to B[][]: 0/4 = 0%
 All accesses: 38%
 Can we do better?
37
Cache-friendly MMM
Cache-unfriendly MMM (ijk)
Cache-friendly MMM (ikj)
for (i=0; i<N; i++)
for (j=0; j<N; j++) {
sum = 0.0;
for (k=0; k < N; k++)
sum += A[i][k]*B[k][j];
C[i][j] += sum;
}
for (i=0; i<N; i++)
for (k=0; k<N; k++) {
r = A[i][k];
for (j=0; j < N; j++)
C[i][j] += r*B[k][j];
}
k
C
i
=
A
X
k
B
i
38
MMM
CPU
Cache hits
C[][]
B[][]
3?
3?
Total accesses
4
4
30 cycles access latency
Cache
200 cycles access latency
RAM
C
A
B
39
Cache-friendly MMM
Cache-unfriendly MMM (ijk)
Cache-friendly MMM (ikj)
A[][]: 3/4 = 75% hit rate
C[][]: 3/4 = 75% hit rate
B[][]: 0/4 = 0% hit rate
B[][]: 3/4 = 75% hit rate
All accesses: 38% hit rate
All accesses: 75% hit rate
Better performance due to cache-friendliness?
40
Performance of MMM
Execution time [s]
10000
1000
100
10
1
0.1
0.01
512
1024
2048
Matrix size
ijk (cache-unfriendly)
4096
8192
ikj (cache-friendly)
41
Performance of MMM
Execution time [s]
10000
1000
20X
100
10
1
0.1
0.01
512
1024
2048
Matrix size
ijk (cache-unfriendly)
4096
8192
ikj (cache-friendly)
42
Cache-aware programming
 Two versions of MMM: ijk and ikj
 Same number of operations (~n3)
 ikj 20X better than ijk
 Good performance depends on two aspects
 Good algorithm
 Implementation that takes hardware into account
 Hardware
 Many possibilities for inefficiencies
 We consider only the memory system in this lecture
43
Outline
 Introduction: performance optimizations
 Cache-aware programming
 Scheduling on multicore processors
 Using run-time feedback
 Data locality optimizations on NUMA-multicores
 Conclusions
 ETH scholarship
44
Cache-based architecture
CPU
10 cycles access latency
L1-C
20 cycles access latency
Cache
L2 Cache
Bus Controller
200 cycles access latency
Memory Controller
RAM
45
Multi-core multiprocessor
Processor package
Processor package
Core
CPU
Core
Core
Core
Core
CPU
Core
Core
Core
L1-C
L1-C
L1-C
L1-C
L1-C
L1-C
L1-C
L1-C
Cache
L2 Cache
Bus Controller
L2 Cache
Cache
L2 Cache
L2 Cache
Bus Controller
Memory Controller
RAM
46
Experiment
 Performance of a well-optimized program
 soplex from SPECCPU 2006
soplex
 Multicore-multiprocessor systems are parallel
 Multiple programs run on the system simultaneously
 Contender program: milc from SPECCPU 2006 milc
 Examine 4 execution scenarios
47
Execution scenarios
Processor 0
Processor 1
soplex
Core
CPU
milc
Core
Core
Core
Core
Core
Core
Core
L1-C
L1-C
L1-C
L1-C
L1-C
L1-C
L1-C
L1-C
L2 Cache
Bus Controller
L2 Cache
Cache
L2 Cache
L2 Cache
Bus Controller
Memory Controller
RAM
48
Execution scenarios
Processor 0
Processor 1
soplex
Core
CPU
milc
Core
Core
Core
Core
Core
Core
Core
L1-C
L1-C
L1-C
L1-C
L1-C
L1-C
L1-C
L1-C
L2 Cache
Bus Controller
L2 Cache
Cache
L2 Cache
L2 Cache
Bus Controller
Memory Controller
RAM
49
Performance with sharing: soplex
Execution time relative to solo execution
2.0
1.8
1.6
1.4
1.2
1.0
0.8
0.6
0.4
0.2
0.0
Solo
Shared cache, bus
Shared bus
controller, memory controller, memory
controller
controller
Shared memory
controller
50
Performance with sharing: soplex
Execution time relative to solo execution
2.0
1.8
1.6
1.4
1.2
1.0
0.8
0.6
0.4
0.2
0.0
Solo
Shared cache, bus
Shared bus
controller, memory controller, memory
controller
controller
Shared memory
controller
51
Performance with sharing: soplex
Execution time relative to solo execution
2.0
1.8
1.6
1.4
1.2
1.0
0.8
0.6
0.4
0.2
0.0
Solo
Shared cache, bus
Shared bus
controller, memory controller, memory
controller
controller
Shared memory
controller
52
Resource sharing
 Significant slowdowns due to resource sharing
 Why is resource sharing so bad?
Example: cache sharing
53
Cache sharing
Core
soplex
Core
milc
L1 Cache
RAM
54
Cache sharing
soplex
Core
milc
Core
L1 Cache
RAM
55
Resource sharing
 Does resource sharing affect all programs?
 So far: we considered at the performance of
 Let us consider a different program: namd
soplex under
contention
56
Performance with sharing
Execution time relative to solo execution
2.0
1.8
1.6
1.4
1.2
1.0
0.8
0.6
0.4
0.2
0.0
Solo
Shared cache,
Shared bus
bus controller,
controller,
memory
memory
controller
controller
soplex
namd
Shared memory
controller
57
Performance with sharing
Execution time relative to solo execution
2.0
1.8
1.6
1.4
1.2
1.0
0.8
0.6
0.4
0.2
0.0
Solo
Shared cache,
Shared bus
bus controller,
controller,
memory
memory
controller
controller
soplex
namd
Shared memory
controller
58
Resource sharing
 Significant slowdown for some programs
soplex
namd
affected significantly
affected less
 What do we do about it?
 Scheduling can help
 Example workload: soplex
namd
soplex soplex soplex
namd
namd
namd
59
Execution scenarios
Processor 0
soplex
soplex
Core
Core
CPU
L1-C
L1-C
L2 Cache
Bus Controller
Processor 0
soplex
Core soplex
Core
L1-C
L2 Cache
L1-C
namd
Core
namd
Core
namd
Core
namd
Core
L1-C
L1-C
L1-C
L1-C
Cache
L2 Cache
L2 Cache
Bus Controller
Memory Controller
RAM
60
Execution scenarios
Processor 0
soplex
soplex
Core
Core
CPU
L1-C
L1-C
L2 Cache
Bus Controller
Processor 0
soplex
Core soplex
Core
L1-C
L2 Cache
L1-C
namd
Core
namd
Core
namd
Core
namd
Core
L1-C
L1-C
L1-C
L1-C
Cache
L2 Cache
L2 Cache
Bus Controller
Memory Controller
RAM
61
Challenges for a scheduler
 Programs have different behaviors
 Behavior not known ahead-of-time
soplex
vs.
namd
 Behavior changes over time
62
Single-phased program
63
Program with multiple phases
64
Outline
 Introduction: performance optimizations
 Cache-aware programming
 Scheduling on multicore processors
 Using run-time feedback
 Data locality optimizations on NUMA-multicores
 Conclusions
 ETH scholarship
65
Hardware performance counters
 Special registers
 Programmable to monitor given hardware event (e.g., cache misses)
 Low-level information about hardware-software interaction
 Low overhead due to hardware implementation
 In the past: undocumented feature
 Since Intel Pentium: publicly available description
 Debugging tools: Intel VTune, Intel PTU, AMD CodeAnalyst
66
Programming performance counters
 Model-specific registers
 Access: RDMSR, WRMSR, and RDPMC instructions
 Ring 0 instructions (available only in kernel-mode)
 perf_events interface
 Standard Linux interface since Linux 2.6.31
 UNIX philosophy: performance counters are files
 Simple API:
 Set up counters: perf_event_open()
 Read counters as files
67
Example: monitoring cache misses
int main() {
int pid = fork();
if (pid == 0) {
exit(exec(“./my_program”, NULL));
} else {
int status; uint64_t value;
int fd = perf_event_open(...);
waitpid(pid, &status, NULL);
read(fd, &value, sizeof(uint64_t);
printf(”Cache misses: %”PRIu64”\n”, value);
}
}
68
perf_event_open()
 Looks simple
int sys_perf_event_open(
struct perf_event_attr
struct perf_event_attr *hw_event_uptr,
__u32 type;
pid_t pid,
int cpu,
int group_fd,
unsigned long flags
);
{
__u32 size;
__u64 config;
union {
__u64 sample_period;
__u64 sample_freq;
};
__u64 sample_type;
__u64 read_format;
__u64 inherit;
__u64 pinned;
__u64 exclusive;
__u64 exclude_user;
__u64 exclude_kernel;
__u64 exclude_hv;
__u64 exclude_idle;
__u64 mmap;
69
libpfm
 Open-source helper library
(3) call perf_event_open()
(1) event name
libpfm
user program
(2) set up perf_event_attr
perf_events
(4) read results
70
Example: measure cache misses for MMM
 Determine microarchitecture
 Intel Xeon E5520: Nehalem microarchitecture
 Look up event needed
 Source: Intel Architectures Software Developer's Manual
71
Software Developer’s Manual
72
Example: measure cache misses for MMM
 Determine microarchitecture
 Intel Xeon E5520: Nehalem microarchitecture
 Look up event needed
 Source: Intel Architectures Software Developer's Manual
 Event name: OFFCORE_RESPONSE_0:ANY_REQUEST:ANY_DRAM
73
MMM cache misses
Millions
# cache misses x 106
1000000
100000
10000
1000
100
10
1
0.1
0.01
0.001
0.0001
0.00001
0.000001
30X
512
1024
2048
Matrix size
ijk (cache-unfriendly)
4096
8192
ikj (cache-friendly)
74
Single-phased program
set up performance counters
read performance counters
75
Program with multiple phases
set up performance counters
get sample
76
Membus: multicore scheduler
1. Dynamically determine program behavior
 Measure # of loads/stores that cause memory traffic
 Hardware performance counters in sampling mode
2. Determine optimal placement based on measurements
77
Evaluation
 Workload with 8 processes
 lbm, soplex, gromacs, hmmer from SPEC CPU 2006
 Two instances of each program
 Experimental results
78
Evaluation
Execution time relative to solo execution
3.0
2.5
2.0
Default Linux
1.5
Membus
1.0
0.5
0.0
lbm
soplex
gromacs
hmmer
Average
79
Evaluation
Execution time relative to solo execution
3.0
2.5
2.0
Default Linux
1.5
Membus
1.0
0.5
0.0
lbm
soplex
gromacs
hmmer
Average
80
Evaluation
Execution time relative to solo execution
3.0
2.5
16%
2.0
8%
Default Linux
1.5
Membus
1.0
0.5
0.0
lbm
soplex
gromacs
hmmer
Average
81
Summary: multicore processors
 Resource sharing critical for performance
 Membus: a scheduler that reduces resource sharing
 Question: why wasn’t Membus able to improve more?
82
Memory controller sharing
Processor 0
Processor 1
soplex
Core
CPU
namd
Core
soplex
Core
namd
Core
namd
Core
L1-C
L1-C
L1-C
L1-C
L1-C
L2 Cache
Bus Controller
L2 Cache
soplex
Core
L1-C
Cache
L2 Cache
namd
Core
soplex
Core
L1-C
L1-C
L2 Cache
Bus Controller
Memory Controller
RAM
83
Non-uniform memory architecture
Processor 0
Processor 1
Core
CPU
Core
Core
Core
Core
Core
Core
Core
L1-C
L1-C
L1-C
L1-C
L1-C
L1-C
L1-C
L1-C
L2 Cache
Bus Controller
L2 Cache
Cache
L2 Cache
L2 Cache
Bus Controller
Memory Controller
RAM
RAM
84
Non-uniform memory architecture
Processor 0
Processor 1
Core
CPU
Core
Core
Core
Core
Core
Core
Core
L1-C
L1-C
L1-C
L1-C
L1-C
L1-C
L1-C
L1-C
L2 Cache
L2 Cache
Cache
L2 Cache
L2 Cache
Memory Ctrl
Interconnect
Interconnect
Memory Ctrl
RAM
RAM
85
Outline
 Introduction: performance optimizations
 Cache-aware programming
 Scheduling on multicore processors
 Using run-time feedback
 Data locality optimizations on NUMA-multicores
 Conclusions
 ETH scholarship
86
Non-uniform memory architecture
Processor 0
Processor 1
Core 0
Core 1
Core 4
Core 5
Core 2
Core 3
Core 6
Core 7
MC
IC
IC
MC
DRAM
DRAM
87
Non-uniform memory architecture
Processor 0
Local memory accesses
Processor 1
Core
T 0
Core 1
Core 4
Core 5
Core 2
Core 3
Core 6
Core 7
MC
IC
IC
MC
DRAM
bandwidth: 10.1 GB/s
latency: 190 cycles
DRAM
Data
All data based on experimental evaluation of Intel Xeon 5500 (Hackenberg [MICRO ’09], Molka [PACT ‘09]) 88
Non-uniform memory architecture
Processor 0
Local memory accesses
Processor 1
Core
T 0
Core 1
Core 4
Core 5
Core 2
Core 3
Core 6
Core 7
MC
IC
IC
MC
DRAM
DRAM
bandwidth: 10.1 GB/s
latency: 190 cycles
Remote memory accesses
bandwidth: 6.3 GB/s
latency: 310 cycles
Data
Key to good performance: data locality
All data based on experimental evaluation of Intel Xeon 5500 (Hackenberg [MICRO ’09], Molka [PACT ‘09]) 89
Data locality in multithreaded programs
Remote memory references / total memory references [%]
60%
50%
40%
30%
20%
10%
0%
cg. B
lu.C
ft.B
ep.C
bt.B
sp.B
NAS Parallel Benchmarks
is.B
mg.C
90
Data locality in multithreaded programs
Remote memory references / total memory references [%]
60%
50%
40%
30%
20%
10%
0%
cg. B
lu.C
ft.B
ep.C
bt.B
sp.B
NAS Parallel Benchmarks
is.B
mg.C
91
First-touch page placement policy
Processor 0
Processor 1
T0
DRAM
T1
DRAM
P0
R/W
92
First-touch page placement policy
Processor 0
Processor 1
T0
DRAM
T1
DRAM
P0
P1
R/W
93
Automatic page placement
 First-touch page placement
 Often high number of remote accesses
 Data address profiling
 Profile-based page-placement
 Supported by hardware performance counters many architectures
94
Profile-based page placement
Based on the work of Marathe et al. [JPDC 2010, PPoPP 2006]
Processor 0
Processor 1
T0
DRAM
Profile
T1
P0
P0
: accessed 1000 times by
T0
P1
P1
: accessed 3000 times by
T1
DRAM
95
Automatic page placement
 Compare: first-touch and profile-based page placement
 Machine: 2-processor 8-core Intel Xeon E5520
 Subset of NAS PB: programs with high fraction of remote accesses
 8 threads with fixed thread-to-core mapping
96
Profile-based page placement
Performance improvement over first-touch [%]
25%
20%
15%
10%
5%
0%
cg.B
lu.C
bt.B
ft.B
sp.B
97
Profile-based page placement
Performance improvement over first-touch [%]
25%
20%
15%
10%
5%
0%
cg.B
lu.C
bt.B
ft.B
sp.B
98
Profile-based page placement
 Performance improvement over first-touch in some cases
 No performance improvement in many cases
 Why?
99
Inter-processor data sharing
Processor 0
Processor 1
T0
Profile
T1
DRAM
DRAM
P0
P1
P0
: accessed 1000 times by
T0
P1
: accessed 3000 times by
T1
P2P2
: accessed 4000 times by
T0
accessed 5000 times by
T1
P2: inter-processor shared
100
Inter-processor data sharing
Processor 0
Processor 1
T0
T1
DRAM
P2
Profile
DRAM
P0
P1
P0
: accessed 1000 times by
T0
P1
: accessed 3000 times by
T1
P2
: accessed 4000 times by
T0
accessed 5000 times by
T1
P2: inter-processor shared
101
Inter-processor data sharing
Shared heap / total heap [%]
60%
50%
40%
30%
20%
10%
0%
cg.B
lu.C
bt.B
ft.B
sp.B
Inter-processor shared heap relative to total heap
102
Inter-processor data sharing
Shared heap / total heap [%]
60%
50%
40%
30%
20%
10%
0%
cg.B
lu.C
bt.B
ft.B
sp.B
Inter-processor shared heap relative to total heap
103
Inter-processor data sharing
Shared heap / total heap [%]
Performance improvement [%]
60%
30%
50%
25%
40%
20%
30%
15%
20%
10%
10%
5%
0%
0%
cg.B
lu.C
bt.B
ft.B
sp.B
Inter-processor shared heap relative to total heap
Performance improvement over first-touch
104
Inter-processor data sharing
Shared heap / total heap [%]
Performance improvement [%]
60%
30%
50%
25%
40%
20%
30%
15%
20%
10%
10%
5%
0%
0%
cg.B
lu.C
bt.B
ft.B
sp.B
Inter-processor shared heap relative to total heap
Performance improvement over first-touch
105
Automatic page placement
 Profile-based page placement often ineffective
 Reason: inter-processor data sharing
 Inter-processor data sharing is a program property
 We propose program transformations
 No time for details now, see results
106
Evaluation
Performance improvement over first-touch [%]
25%
20%
15%
10%
5%
0%
cg.B
lu.C
Profile-based allocation
bt.B
ft.B
sp.B
Program transformations
107
Evaluation
Performance improvement over first-touch [%]
25%
20%
15%
10%
5%
0%
cg.B
lu.C
Profile-based allocation
bt.B
ft.B
sp.B
Program transformations
108
Conclusions
 Performance optimizations
 Good algorithm + hardware-awareness
 Example: cache-aware matrix multiplication
 Hardware awareness
 Resource sharing in multicore processors
 Data placement in non-uniform memory architectures
 A lot remains to be done...
 ...and you can be part of it!
109
ETH scholarship for masters students...
 ...to work on their master thesis
In the Laboratory of Software Technology
Prof. Thomas R. Gross
PhD. Stanford University, MIPS project, supervisor John L. Hennessy
Carnegie Mellon: Warp, iWarp, Fx projects
 ETH offers to you
 Monthly scholarship of CHF 1500– 1700 (EUR 1200–1400)
 Assistance with finding housing
 Thesis topic
110
Possible Topics
Michael Pradel: Automatic bug finding
Luca Della Toffola: Performance optimizations for Java
Me: Hardware-aware performance optimizations
111
OO code positioning
Call Graph
A
B
C
D
E
Cache
A
B
A
B
C
…
Memory
…
C
D
E
…
Call Graph
A
B
C
D
E
Profiling
Hot Path
Call Graph
A
B
C
D
E
Cache
A
B
C
…
Miss
Memory
…
A
B
C
D
E
…
Call Graph
A
B
C
D
E
• JVM
• No Profiling
• Constructors
Cache
A
B
E
…
Hit
Memory
…
A
B
E
D
C
…
• Linked list traversal
• Looking for the youngest/oldest person
Person
Person
Person
Person
next
next
next
next
name
name
name
name
surname
surname
surname
surname
age
age
age
age
null
Cache
next
name surname age
next
name surname age
next
name surname age
next
name surname age
next
name surname age
next
name surname age
Cache
next
name surname age
next
name surname age
next
name surname age
next
name surname age
next
name surname age
next
name surname age
Cache
next
age
next
age
next
age
next
age
next
age
next
age
next
age
next
age
next
age
next
age
next
age
next
age
•
•
•
•
Jikes RVM
Splitting strategies
Garbage collection optimizations
Allocation optimizations
# field accesses
A
a1
a2
Profiling
A
A
a3
a1: 10
a5
a2: 100
a3
a3: 1000
a4
a4: 30
a5
a5: 2000
Splitting
Class$Cold
A$Cold
a1
a2
a4
hot field
cold field
If interested and motivated
 Apply
 @ Prof. Rodica Potolea
 Until August 2012
 Come to Zurich
 Start in February 2013
 Work 4-6 months on the thesis
 If you have questions
 Send e-mail to me [email protected]
 Talk to Prof. Rodica Potolea
121
Thank you for your attention!
122