Transcript Slide 1

Memory System Performance in a
NUMA Multicore Multiprocessor
Zoltan Majo and Thomas R. Gross
Department of Computer Science
ETH Zurich
1
Summary
• NUMA multicore systems are unfair to
local memory accesses
• Local execution sometimes suboptimal
2
Outline
• NUMA multicores: how it happened
• Experimental evaluation: Intel Nehalem
• Bandwidth sharing model
• The next generation: Intel Westmere
3
NUMA multicores: how it happened
Total bandwidth [GB/s]
25
First generation: SMP
0
1
BusC
2
3
4
BusC
5
BusC
6
7
BusC
20
SMP
15
Northbridge
MC
10
5
DRAM memory
0
1 2 3 4 5 6 7 8
Active cores
4
NUMA multicores: how it happened
Total bandwidth [GB/s]
25
Next generation: NUMA
0
1
BusC
2
3
BusC
IC
4
5
BusC
IC
6
7
BusC
20
SMP
15
Northbridge
MC MC MC
10
5
DRAM memory
0
1 2 3 4 5 6 7 8
Active cores
5
NUMA multicores: how it happened
Total bandwidth [GB/s]
25
Next generation: NUMA
0
1
MC
2
3
IC
4
5
IC
6
7
MC
20
SMP
15
NUMA
(local)
10
5
DRAM memory
DRAM memory
0
1 2 3 4 5 6 7 8
Active cores
6
NUMA multicores: how it happened
Total bandwidth [GB/s]
25
Next generation: NUMA
0
1
MC
2
3
IC
4
5
IC
6
7
MC
20
SMP
15
NUMA
(local)
10
NUMA
(remote)
5
DRAM memory
DRAM memory
0
1 2 3 4 5 6 7 8
Active cores
7
Bandwidth sharing
• Frequent scenario:
bandwidth shared
between cores
• Sharing model for
the Intel Nehalem
0
1
MC
2
3
IC
DRAM memory
4
5
IC
6
7
MC
DRAM memory
8
Outline
• NUMA multicores: how it happened
• Experimental evaluation: Intel Nehalem
• Bandwidth sharing model
• The next generation: Intel Westmere
9
Evaluation system
Processor 0
0
1
2
Processor 1
3
4
5
6
7
Level 3 cache
Level 3 cache
Global Queue
Global Queue
MC
QPI
QPI
MC
Intel Nehalem E5520
2 x 4 cores
8 MB level 3 cache
12 GB DDR3 RAM
DRAM memory
DRAM memory
5.86 GT/s QPI
10
Bandwidth sharing: local accesses
Processor 0
0
1
2
Processor 1
3
4
5
6
7
Level 3 cache
Level 3 cache
Global Queue
Global Queue
MC
QPI
DRAM memory
QPI
MC
DRAM memory
11
Bandwidth sharing: remote accesses
Processor 0
0
1
2
Processor 1
3
4
5
6
7
5
Level 3 cache
Level 3 cache
Global Queue
Global Queue
MC
QPI
DRAM memory
QPI
MC
DRAM memory
12
Bandwidth sharing: combined accesses
Processor 0
0
1
2
Processor 1
3
4
5
6
7
5
Level 3 cache
Level 3 cache
Global Queue
Global Queue
MC
QPI
DRAM memory
QPI
MC
DRAM memory
13
Global Queue
• Mechanism to arbitrate between different
types of memory accesses
• We look at fairness of the Global Queue:
– local memory accesses
– remote memory accesses
– combined memory accesses
14
Benchmark program
• STREAM triad
for (i=0; i<SIZE; i++)
{
a[i]=b[i]+SCALAR*c[i];
}
• Multiple co-executing triad clones
15
Multi-clone experiments
• All memory allocated on Processor 0
• Local clones:
Remote clones:
C
C
• Example benchmark configurations:
(2L, 0R)
C
C
Processor 0
(0L, 3R)
C
C
(2L, 3R)
C
Processor 1
C
C
Processor 0
C
C
C
Processor 1
16
GQ fairness: local accesses
Total bandwidth [GB/s]
Processor 0
Processor 1
C0 C1 C2 C3
4 5 6 7
Cache
Cache
GQ
GQ
IMC
QPI
QPI
IMC
14
12
10
8
6
4
2
DRAM
memory
DRAM
0
1L
Core 0
2L
3L
Benchmark configurations
Core 1
Core 2
4L
Core 3
17
GQ fairness: remote accesses
Total bandwidth [GB/s]
Processor 0
Processor 1
0 1 2 3
4 C
5 C
6 C
7
C
Cache
Cache
GQ
GQ
IMC
QPI
QPI
IMC
14
12
10
8
6
4
2
DRAM
memory
DRAM
0
1L
1R
Core 0
2L 2R 3L 3R 4L
Benchmark configurations
Core 1
Core 2
4R
Core 3
18
Global Queue fairness
• Global Queue fair when there are
only local/remote accesses in the system
• What about combined accesses?
19
GQ fairness: combined accesses
Execute clones in all possible configurations
# local clones
0
1
2
3
4
# remote clones
0
1
2
3
(2L, 3R)
4
20
GQ fairness: combined accesses
Execute clones in all possible configurations
# local clones
0
1
2
3
4
# remote clones
0
1
2
3
4
21
GQ fairness: combined accesses
Total bandwidth [GB/s]
14
12
10
8
6
4
2
0
(4L, 0R)
(4L, 1R)
(4L, 2R)
(4L, 3R)
Benchmark configurations
local clones
(4L, 4R)
remote clones
22
GQ fairness: combined accesses
Execute clones in all possible configurations
# local clones
0
1
2
3
4
# remote clones
0
1
2
3
4
23
Combined accesses
Total bandwidth [GB/s]
16
14
12
remote clone
10
local clone 1
8
local clone 2
6
local clone 3
4
local clone 4
2
0
(1L,1R)
(2L,1R)
(3L,1R)
(4L,1R)
24
Combined accesses
• In configuration (4L, 1R) remote clone gets
30% more bandwidth than a local clone
• Remote execution can be better than local
25
Outline
• NUMA multicores: how it happened
• Experimental evaluation: Intel Nehalem
• Bandwidth sharing model
• The next generation: Intel Westmere
26
Bandwidth sharing model
bandwidthtotal  (1   )  bandwidthlocal    bandwidthremote
C
0
1
2
3
C
4
5
6
7
Level 3 cache
Level 3 cache
Global Queue
Global Queue
IMC
QPI
QPI
DRAM memory
IMC
DRAM memory
27
Sharing factor ()
• Characterizes the fairness of the Global Queue
• Dependence of sharing factor on contention?
28
Contention affects sharing factor
Processor 0
DRAM
Processor 0
C
QPI
C
contenders
C
C
C
29
Contention affects sharing factor
Sharing factor ()
50%
40%
30%
20%
10%
0%
+0L
+1L
+2L
+3L
Additional contention
30
Combined accesses
Total bandwidth [GB/s]
16
14
12
remote clone
10
local clone 1
8
local clone 2
6
local clone 3
4
local clone 4
2
0
(1L,1R)
(2L,1R)
(3L,1R)
(4L,1R)
31
Contention affects sharing factor
• Sharing factor decreases with contention
• With local contention remote execution
becomes more favorable
32
Outline
• NUMA multicores: how it happened
• Experimental evaluation: Intel Nehalem
• Bandwidth sharing model
• The next generation: Intel Westmere
33
The next generation
0 1 2 3 4 5
6 7 8 9 A B
Level 3 cache
Level 3 cache
Global Queue
Global Queue
IMC
QPI
QPI
IMC
Intel Westmere X5680
2 x 6 cores
12 MB level 3 cache
144 GB DDR3 RAM
DRAM memory
DRAM memory
6.4 GT/s QPI
34
The next generation
Total bandwidth [GB/s]
12
10
8
remote clone
local clone 1
6
local clone 2
4
local clone 3
local clone 4
2
local clone 5
(6L, 1R)
(5L, 1R)
(4L, 1R)
(3L, 1R)
(2L, 1R)
(1L, 1R)
0
local clone 6
Benchmark configurations
35
Conclusions
• Optimizing for data locality can be suboptimal
• Applications:
– OS scheduling (see ISMM’11 paper)
– data placement and computation scheduling
36
Thank you! Questions?
37