Transcript Document
Matching
Memory Access Patterns and Data Placement
for NUMA Systems
Zoltán Majó
Thomas R. Gross
Computer Science Department
ETH Zurich, Switzerland
Non-uniform memory architecture
Processor 0
Processor 1
Core 0
Core 1
Core 4
Core 5
Core 2
Core 3
Core 6
Core 7
MC
IC
IC
MC
DRAM
DRAM
2
Non-uniform memory architecture
Processor 0
Local memory accesses
Processor 1
Core
T 0
Core 1
Core 4
Core 5
Core 2
Core 3
Core 6
Core 7
MC
IC
IC
MC
DRAM
bandwidth: 10.1 GB/s
latency: 190 cycles
DRAM
Data
All data based on experimental evaluation of Intel Xeon 5500 (Hackenberg [MICRO ’09], Molka [PACT ‘09])
3
Non-uniform memory architecture
Processor 0
Local memory accesses
Processor 1
Core
T 0
Core 1
Core 4
Core 5
Core 2
Core 3
Core 6
Core 7
MC
IC
IC
MC
DRAM
DRAM
bandwidth: 10.1 GB/s
latency: 190 cycles
Remote memory accesses
bandwidth: 6.3 GB/s
latency: 310 cycles
Data
Key to good performance: data locality
All data based on experimental evaluation of Intel Xeon 5500 (Hackenberg [MICRO ’09], Molka [PACT ‘09])
4
Data locality in multithreaded programs
Remote memory references / total memory references [%]
60%
50%
40%
30%
20%
10%
0%
cg. B
lu.C
ft.B
ep.C
bt.B
sp.B
NAS Parallel Benchmarks
is.B
mg.C
5
Data locality in multithreaded programs
Remote memory references / total memory references [%]
60%
50%
40%
30%
20%
10%
0%
cg. B
lu.C
ft.B
ep.C
bt.B
sp.B
NAS Parallel Benchmarks
is.B
mg.C
6
Outline
Automatic page placement
Memory access patterns of matrix-based computations
Matching memory access patterns and data placement
Evaluation
Conclusions
7
Automatic page placement
Current OS support for NUMA: first-touch page placement
Often high number of remote accesses
Data address profiling
Profile-based page-placement
Supported in hardware on many architectures
8
Profile-based page placement
Based on the work of Marathe et al. [JPDC 2010, PPoPP 2006]
Processor 0
Processor 1
T0
DRAM
Profile
T1
P0
P0
: accessed 1000 times by
T0
P1
P1
: accessed 3000 times by
T1
DRAM
9
Automatic page placement
Compare: first-touch and profile-based page placement
Machine: 2-processor 8-core Intel Xeon E5520
Subset of NAS PB: programs with high fraction of remote accesses
8 threads with fixed thread-to-core mapping
10
Profile-based page placement
Performance improvement over first-touch [%]
25%
20%
15%
10%
5%
0%
cg.B
lu.C
bt.B
ft.B
sp.B
11
Profile-based page placement
Performance improvement over first-touch [%]
25%
20%
15%
10%
5%
0%
cg.B
lu.C
bt.B
ft.B
sp.B
12
Inter-processor data sharing
Processor 0
Processor 1
T0
Profile
T1
DRAM
DRAM
P0
P1
P0
: accessed 1000 times by
T0
P1
: accessed 3000 times by
T1
P2P2
: accessed 4000 times by
T0
accessed 5000 times by
T1
P2: inter-processor shared
13
Inter-processor data sharing
Processor 0
Processor 1
T0
T1
DRAM
P2
Profile
DRAM
P0
P1
P0
: accessed 1000 times by
T0
P1
: accessed 3000 times by
T1
P2
: accessed 4000 times by
T0
accessed 5000 times by
T1
P2: inter-processor shared
14
Inter-processor data sharing
Shared heap / total heap [%]
60%
50%
40%
30%
20%
10%
0%
cg.B
lu.C
bt.B
ft.B
sp.B
Inter-processor shared heap relative to total heap
15
Inter-processor data sharing
Shared heap / total heap [%]
60%
50%
40%
30%
20%
10%
0%
cg.B
lu.C
bt.B
ft.B
sp.B
Inter-processor shared heap relative to total heap
16
Inter-processor data sharing
Shared heap / total heap [%]
Performance improvement [%]
60%
30%
50%
25%
40%
20%
30%
15%
20%
10%
10%
5%
0%
0%
cg.B
lu.C
bt.B
ft.B
sp.B
Inter-processor shared heap relative to total heap
Performance improvement over first-touch
17
Inter-processor data sharing
Shared heap / total heap [%]
Performance improvement [%]
60%
30%
50%
25%
40%
20%
30%
15%
20%
10%
10%
5%
0%
0%
cg.B
lu.C
bt.B
ft.B
sp.B
Inter-processor shared heap relative to total heap
Performance improvement over first-touch
18
Automatic page placement
Profile-based page placement often ineffective
Reason: inter-processor data sharing
Inter-processor data sharing is a program property
Detailed look: program memory access patterns
Loop-parallel programs with OpenMP-like parallelization
Matrix processing
NAS BT
19
Matrix processing
m[NX][NY]
Process m sequentially
NX
for (i=0; i<NX; i++)
for (j=0; j<NY; j++)
// access m[i][j]
NY
20
Matrix processing
m[NX][NY]
Process m x-wise parallel
T0
T1
T2
T3
NX
T4
#pragma omp parallel for
for (i=0; i<NX; i++)
for (j=0; j<NY; j++)
// access m[i][j]
T5
T6
T7
NY
21
Thread scheduling
Remember: fixed thread-to-core mapping
T0
Processor 0
Processor 1
DRAM
DRAM
T1
T2
T3
T4
T5
T6
T7
22
Matrix processing
m[NX][NY]
Process m x-wise parallel
T0
T1
T2
Allocated at
Processor 0
T3
NX
T4
T5
T6
#pragma omp parallel for
for (i=0; i<NX; i++)
for (j=0; j<NY; j++)
// access m[i][j]
Allocated at
Processor 1
T7
NY
23
Matrix processing
m[NX][NY]
T0
T1
T2
Allocated at
Processor 0
T3
T4
Process m y-wise
x-wise parallel
T5
T6
Allocated at
Processor 1
T7
NX
for (i=0;
#pragma
omp
i<NX;
parallel
i++) for
for (i=0;
#pragma
omp
i<NX;
parallel
i++) for
for (j=0; j<NY; j++)
// access m[i][j]
NY
24
Example: NAS BT
m[NX][NY]
T0
T1
T2
T3
T4
Time-step iteration
T5
T6
T7
T0
T1
T2
T3
NX
T4
T5
for (t=0; t<TMAX; t++)
{
x_wise();
y_wise();
}
T6
T7
NY
25
Example: NAS BT
m[NX][NY]
T0
T0
T1
T1
T2
T3
T4
Time-step iteration
T5
T6
Allocated at
Processor 0
Appropriate
allocation
not possible
Appropriate
allocation
not possible
Allocated at
Processor 1
T2
T7
T3
T4
T5
T6
NX
for (t=0; t<TMAX; t++)
{
x_wise();
y_wise();
}
T7
NY
Result:
Inter-processor shared heap: 35%
Remote accesses: 19%
26
Solution?
1. Adjust data placement
High overhead of runtime data migration cancels benefit
2. Adjust iteration scheduling
Limited by data dependences
3. Adjust data placement and iteration scheduling together
27
API
Library for data placement
Set of common data distributions
Affinity-aware loop iteration scheduling
Extension to GCC OpenMP implementation
Example use case: NAS BT
28
Use-case: NAS BT
Remember: BT has two incompatible access patterns
Repeated x-wise and y-wise access to the same data
Idea: data placement to accommodate both access patterns
Allocated at
Processor 0
Blocked-exclusive
data placement
Allocated at
Processor 1
NX
Allocated at
Processor 1
Allocated at
Processor 0
NY
29
Use-case: NAS BT
distr_t *distr;
distr = block_exclusive_distr( m, sizeof(m), sizeof(m[0]/2));
distribute_to(distr);
for (t=0; t<TMAX; t++)
{
x_wise();
y_wise();
}
30
Use-case: NAS BT
distr_t *distr;
distr = block_exclusive_distr( m, sizeof(m), sizeof(m[0]/2));
distribute_to(distr);
for (t=0; t<TMAX; t++)
{
x_wise();
#pragma omp parallel for
for (i=0; i<NX; i++)
y_wise();
for (j=0; j<NY; j++)
}
//access m[i][j]
31
x_wise()
Matrix processed in two steps
Step 1: left half
all accesses local
Step 2: right half
all accesses local
T0
T1
T2
Allocated at
Processor 0
Allocated at
Processor 1
T3
NX
T4
T5
T6
Allocated at
Processor 1
Allocated at
Processor 0
T7
NY / 2
NY / 2
32
Use-case: NAS BT
distr_t *distr;
distr = block_exclusive_distr( m, sizeof(m), sizeof(m[0]/2));
distribute_to(distr);
for (t=0; t<TMAX; t++)
{
x_wise();
#pragma omp parallel for
for (i=0; i<NX; i++)
for (j=0; j<NY/2;
j<NY; j++)
j++)
//access m[i][j]
#pragma omp parallel for
for (i=0; i<NX; i++)
for (j=NY/2; j<NY; j++)
//access m[i][j]
y_wise();
}
33
Use-case: NAS BT
distr_t *distr;
distr = block_exclusive_distr( m, sizeof(m), sizeof(m[0]/2));
distribute_to(distr);
for (t=0; t<TMAX; t++)
{
x_wise();
#pragma omp parallel for schedule(static)
for (i=0; i<NX; i++)
for (j=0; j<NY/2; j++)
//access m[i][j]
#pragma omp parallel for schedule(static-inverse)
for (i=0; i<NX; i++)
for (j=NY/2; j<NY; j++)
//access m[i][j]
y_wise();
}
34
Matrix processing
m[NX][NY]
Process m x-wise parallel
T0
T1
T2
T3
NX
T4
T5
#pragma omp parallel for
forschedule(static)
(i=0; i<NX; i++)
forfor
(i=0;
(j=0;
i<NX;
j<NY;
i++)
j++)
for //
(j=0;
j<NY;
j++)
access
m[i][j]
// access m[i][j]
T6
T7
NY
35
Matrix processing
m[NX][NY]
Process m x-wise parallel
T0
m[0
..
NX/8 - 1][*]
T1
m[NX/8
.. 2*NX/8 - 1][*]
T2
m[2*NX/8.. 3*NX/8 - 1][*]
T3
m[3*NX/8.. 4*NX/8 - 1][*]
T4
m[4*NX/8.. 5*NX/8 - 1][*]
T5
m[5*NX/8 ..6*NX/8 - 1][*]
T6
m[6*NX/8 ..7*NX/8 - 1][*]
T7
m[7*NX/8 ..
NX
#pragma omp parallel for
forschedule(static)
(i=0; i<NX; i++)
forfor
(i=0;
(j=0;
i<NX;
j<NY;
i++)
j++)
for //
(j=0;
j<NY;
j++)
access
m[i][j]
// access m[i][j]
NX - 1][*]
NY
36
static vs. static-inverse
#pragma omp parallel for
schedule(static)
for (i=0; i<NX; i++)
for (j=0; j<NY; j++)
// access m[i][j]
#pragma omp parallel for
schedule(static-inverse)
for (i=0; i<NX; i++)
for (j=0; j<NY; j++)
// access m[i][j]
T0
m[0
..
NX/8 - 1][*]
T1
m[NX/8
.. 2*NX/8 - 1][*]
T2
m[2*NX/8 .. 3*NX/8 - 1][*]
T3
m[3*NX/8 .. 4*NX/8 - 1][*]
T4
m[4*NX/8 .. 5*NX/8 - 1][*]
T5
m[5*NX/8 .. 6*NX/8 - 1][*]
T6
m[6*NX/8 .. 7*NX/8 - 1][*]
T7
m[7*NX/8 ..
T0
m[0
..
T1
m[NX/8
.. 2*NX/8 - 1][*]
T2
m[2*NX/8 .. 3*NX/8 - 1][*]
T3
m[3*NX/8 .. 4*NX/8 - 1][*]
T4
m[4*NX/8 .. 5*NX/8 - 1][*]
T5
m[5*NX/8 .. 6*NX/8 - 1][*]
T6
m[6*NX/8 .. 7*NX/8 - 1][*]
T7
m[7*NX/8 ..
NX - 1][*]
NX/8 - 1][*]
NX - 1][*]
37
y_wise()
T0
T1
T2
T3
T4
T5
T6
Matrix processed in two steps
T7
Allocated at
Processor 0
Allocated at
Processor 1
NX / 2
Allocated at
Processor 1
Allocated at
Processor 0
NX / 2
Step 1: upper half
all accesses local
Step 2: lower half
all accesses local
NY
38
Outline
Profile-based page placement
Memory access patterns
Matching data distribution and iteration scheduling
Evaluation
Conclusions
39
Evaluation
Performance improvement over first-touch [%]
25%
20%
15%
10%
5%
0%
cg.B
lu.C
Profile-based allocation
bt.B
ft.B
sp.B
Program transformations
40
Evaluation
Performance improvement over first-touch [%]
25%
20%
15%
10%
5%
0%
cg.B
lu.C
Profile-based allocation
bt.B
ft.B
sp.B
Program transformations
41
Evaluation
Performance improvement over first-touch [%]
25%
20%
15%
10%
5%
0%
cg.B
lu.C
Profile-based allocation
bt.B
ft.B
sp.B
Program transformations
42
Scalability
Machine: 4-processor 32-core Intel Xeon E7-4830
Performance improvement over first-touch [%]
250%
200%
150%
100%
50%
0%
cg.C
lu.C
bt.C
ft.C
sp.C
43
Scalability
Machine: 4-processor 32-core Intel Xeon E7-4830
Performance improvement over first-touch [%]
250%
200%
150%
100%
50%
0%
cg.C
lu.C
bt.C
ft.C
sp.C
44
Conclusions
Automatic data placement (still) limited
Alternating memory access patterns
Inter-processor data sharing
Match memory access patterns and data placement
Simple API: practical solution that works today
Ample opportunities for further improvement
45
Thank you for your attention!
46