Memory System Performance in a NUMA Multicore Multiprocessor

Download Report

Transcript Memory System Performance in a NUMA Multicore Multiprocessor

Memory Management in
NUMA Multicore Systems:
Trapped between Cache Contention and
Interconnect Overhead
Zoltan Majo and Thomas R. Gross
Department of Computer Science
ETH Zurich
1
NUMA multicores
Processor 0
0
Processor 1
1
2
Cache
MC
3
Cache
IC
DRAM memory
IC
MC
DRAM memory
2
NUMA multicores
Two problems:
Processor 0
A
0
Processor 1
1
B
2
Cache
MC
3
Cache
IC
DRAM
MA memory
MB
IC
• NUMA:
interconnect overhead
MC
DRAM memory
3
NUMA multicores
Two problems:
Processor 0
A
0
Processor 1
B
1
2
Cache
MC
3
Cache
IC
DRAM
MA memory
MB
IC
MC
• NUMA:
interconnect overhead
• multicore:
cache contention
DRAM memory
4
Outline
• NUMA: experimental evaluation
• Scheduling
– N-MASS
– N-MASS evaluation
5
Multi-clone experiments
• Memory behavior of unrelated programs
Processor 0
• Intel Xeon E5520
0
C
1
C
Processor 1
2
3
4
C
5
C
Cache
• 4 clones of soplex
(SPEC CPU2006)
C
C
–C local
clone
C
M
M
M
–M remote
clone
C
MC
6
7
Cache
IC
DRAM memory
IC
MC
DRAM memory
C
6
Local bandwidth: 80%
Local bandwidth: 100%
C C C C
Cache
1
C C C
Cache
Cache
2
C
Cache
DRAM
M
M M M
DRAM
M
M M M
Local bandwidth: 57%
C C
Cache
3
C C
Cache
DRAM
M
M M M
Local bandwidth: 32%
C
Cache
DRAM
M
M M M
4
Local bandwidth: 0%
C C C
Cache
Cache
DRAM
M
M M M
5
C C C C
Cache
7
Performance of schedules
• Which is the best schedule?
• Baseline: single-program execution mode
C
Cache
Cache
M
8
Execution time
Slowdown relative to baseline
C
local clones
C
remote clones
C
average
9
Outline
• NUMA: experimental evaluation
• Scheduling
– N-MASS
– N-MASS evaluation
10
N-MASS
(NUMA-Multicore-Aware Scheduling Scheme)
Two steps:
– Step 1: maximum-local mapping
– Step 2: cache-aware refinement
11
Step 1: Maximum-local mapping
Processor 0
Processor 1
A
MA
0
B
MB
Cache
Cache
C
MC
D
MD
DRAM
DRAM
1
2
3
4
5
6
7
12
Default OS scheduling
Processor 0
A
0
B
1
C
2
Processor 1
3
Cache
DRAM
MA MB
D
4
5
6
7
Cache
MC
DRAM
MD
13
N-MASS
(NUMA-Multicore-Aware Scheduling Scheme)
Two steps:
– Step 1: maximum-local mapping
– Step 2: cache-aware refinement
14
Step 2: Cache-aware refinement
Processor 0
In an SMP:
0
1
D
2
Processor 1
3C
Cache
DRAM
MA MB
4
5
A
6
B
7
Cache
MC
DRAM
MD
15
Step 2: Cache-aware refinement
Processor 0
In an SMP:
A
0
B
1
C2
Processor 1
3
Cache
DRAM
MA MB
D
4
5
6
7
Cache
MC
DRAM
MD
16
Step 2: Cache-aware refinement
Processor 0
In an SMP:
A
0
B
1
Processor 1
2
3
Cache
DRAM
MA MB
D
4
5
C
6
7
Cache
MC
NUMA
penalty
DRAM
MD
Performance degradation
A
B
A
B
C
D
D
C
17
Step 2: Cache-aware refinement
Processor 0
In a NUMA:
A
0
B
1
C2
Processor 1
3
Cache
DRAM
MA MB
D
4
5
6
7
Cache
MC
DRAM
MD
18
Step 2: Cache-aware refinement
Processor 0
In a NUMA:
A
0
B
1
C2
Processor 1
3
Cache
DRAM
MA MB
D
4
5
6
7
Cache
MC
DRAM
MD
19
Step 2: Cache-aware refinement
Processor 0
In a NUMA:
0
A
1
Processor 1
2
C
3
Cache
DRAM
MA MB
4
D
5
B
6
Cache
MC
NUMA
allowance
DRAM
MD
Performance degradation
A
A
B
7
NUMA
penalty
C
C
D
D
B
20
Performance factors
Two factors cause performance degradation:
NUMA penalty
1. NUMA penalty
slowdown due to
remote memory access
2. cache pressure
local processes:
misses / KINST (MPKI)
remote processes:
MPKI x NUMA penalty
21
Implementation
• User-mode extension to the Linux scheduler
• Performance metrics
– hardware performance counter feedback
– NUMA penalty
• perfect information from program traces
• estimate based on MPKI
• All memory for a process allocated on one
processor
22
Outline
• NUMA: experimental evaluation
• Scheduling
– N-MASS
– N-MASS evaluation
23
Workloads
• SPEC CPU2006 subset
NUMA penalty
• 11 multi-program
workloads (WL1  WL11)
CPU-bound
4-program workloads
(WL1  WL9)
Memorybound
8-program workloads
(WL10, WL11)
24
Memory allocation setup
• Where the memory of each process is
allocated influences performance
• Controlled setup: memory allocation maps
25
Memory allocation maps
A
M
A
B
M
B
C
M
C
D
M
D
Allocation map: 0000
Processor 0
Processor 1
Cache
M
M M M
DRAM
A B C D
Cache
DRAM
26
Memory allocation maps
A
B
Allocation map: 0000
Processor 0
Processor 1
Cache
M
M M M
DRAM
A B C D
Cache
DRAM
C
D
Allocation map: 0011
Processor 0
Processor 1
Cache
M
M
DRAM
A B
Cache
DRAMM M
C D
27
Memory allocation maps
A
B
Allocation map: 0000
Processor 0
Processor 1
Cache
M
M M M
DRAM
A B C D
Cache
DRAM
Unbalanced
C
D
Allocation map: 0011
Processor 0
Processor 1
Cache
M
M
DRAM
A B
Cache
DRAMM M
C D
Balanced
28
Evaluation
• Baseline: Linux average
– Linux scheduler non-deterministic
– average performance degradation in all possible
cases
• N-MASS with perfect NUMA penalty
information
29
WL9: Linux average
Average slowdown relative to single-program mode
30
WL9: N-MASS
Average slowdown relative to single-program mode
31
WL1: Linux average and N-MASS
Average slowdown relative to single-program mode
32
N-MASS performance
• N-MASS reduces performance degradation by
up to 22%
• Which factor more important:
interconnect overhead or cache contention?
• Compare:
- maximum-local
- N-MASS (maximum-local
+ cache refinement step)
33
Data-locality vs. cache balancing (WL9)
Performance improvement relative to Linux average
34
Data-locality vs. cache balancing (WL1)
Performance improvement relative to Linux average
35
Data locality vs. cache balancing
• Data-locality more important than cache
balancing
• Cache-balancing gives performance benefits
mostly with unbalanced allocation maps
• What if information about NUMA penalty not
available?
36
Estimating NUMA penalty
• NUMA penalty is not
directly measurable
NUMA penalty
• Estimate: fit linear
regression onto MPKI
data
37
Estimate-based N-MASS: performance
Performance improvement relative to Linux average
38
Conclusions
• N-MASS: NUMAmulticore-aware scheduler
• Data locality optimizations more beneficial
than cache contention avoidance
• Better performance metrics needed for
scheduling
39
Thank you! Questions?
40