Transcript Document

Matching
Memory Access Patterns and Data Placement
for NUMA Systems
Zoltán Majó
Thomas R. Gross
Computer Science Department
ETH Zurich, Switzerland
Non-uniform memory architecture
Processor 0
Processor 1
Core 0
Core 1
Core 4
Core 5
Core 2
Core 3
Core 6
Core 7
MC
IC
IC
MC
DRAM
DRAM
2
Non-uniform memory architecture
Processor 0
Local memory accesses
Processor 1
Core
T 0
Core 1
Core 4
Core 5
Core 2
Core 3
Core 6
Core 7
MC
IC
IC
MC
DRAM
bandwidth: 10.1 GB/s
latency: 190 cycles
DRAM
Data
All data based on experimental evaluation of Intel Xeon 5500 (Hackenberg [MICRO ’09], Molka [PACT ‘09])
3
Non-uniform memory architecture
Processor 0
Local memory accesses
Processor 1
Core
T 0
Core 1
Core 4
Core 5
Core 2
Core 3
Core 6
Core 7
MC
IC
IC
MC
DRAM
DRAM
bandwidth: 10.1 GB/s
latency: 190 cycles
Remote memory accesses
bandwidth: 6.3 GB/s
latency: 310 cycles
Data
Key to good performance: data locality
All data based on experimental evaluation of Intel Xeon 5500 (Hackenberg [MICRO ’09], Molka [PACT ‘09])
4
Data locality in multithreaded programs
Remote memory references / total memory references [%]
60%
50%
40%
30%
20%
10%
0%
cg. B
lu.C
ft.B
ep.C
bt.B
sp.B
NAS Parallel Benchmarks
is.B
mg.C
5
Data locality in multithreaded programs
Remote memory references / total memory references [%]
60%
50%
40%
30%
20%
10%
0%
cg. B
lu.C
ft.B
ep.C
bt.B
sp.B
NAS Parallel Benchmarks
is.B
mg.C
6
Outline
 Automatic page placement
 Memory access patterns of matrix-based computations
 Matching memory access patterns and data placement
 Evaluation
 Conclusions
7
Automatic page placement
 Current OS support for NUMA: first-touch page placement
 Often high number of remote accesses
 Data address profiling
 Profile-based page-placement
 Supported in hardware on many architectures
8
Profile-based page placement
Based on the work of Marathe et al. [JPDC 2010, PPoPP 2006]
Processor 0
Processor 1
T0
DRAM
Profile
T1
P0
P0
: accessed 1000 times by
T0
P1
P1
: accessed 3000 times by
T1
DRAM
9
Automatic page placement
 Compare: first-touch and profile-based page placement
 Machine: 2-processor 8-core Intel Xeon E5520
 Subset of NAS PB: programs with high fraction of remote accesses
 8 threads with fixed thread-to-core mapping
10
Profile-based page placement
Performance improvement over first-touch [%]
25%
20%
15%
10%
5%
0%
cg.B
lu.C
bt.B
ft.B
sp.B
11
Profile-based page placement
Performance improvement over first-touch [%]
25%
20%
15%
10%
5%
0%
cg.B
lu.C
bt.B
ft.B
sp.B
12
Inter-processor data sharing
Processor 0
Processor 1
T0
Profile
T1
DRAM
DRAM
P0
P1
P0
: accessed 1000 times by
T0
P1
: accessed 3000 times by
T1
P2P2
: accessed 4000 times by
T0
accessed 5000 times by
T1
P2: inter-processor shared
13
Inter-processor data sharing
Processor 0
Processor 1
T0
T1
DRAM
P2
Profile
DRAM
P0
P1
P0
: accessed 1000 times by
T0
P1
: accessed 3000 times by
T1
P2
: accessed 4000 times by
T0
accessed 5000 times by
T1
P2: inter-processor shared
14
Inter-processor data sharing
Shared heap / total heap [%]
60%
50%
40%
30%
20%
10%
0%
cg.B
lu.C
bt.B
ft.B
sp.B
Inter-processor shared heap relative to total heap
15
Inter-processor data sharing
Shared heap / total heap [%]
60%
50%
40%
30%
20%
10%
0%
cg.B
lu.C
bt.B
ft.B
sp.B
Inter-processor shared heap relative to total heap
16
Inter-processor data sharing
Shared heap / total heap [%]
Performance improvement [%]
60%
30%
50%
25%
40%
20%
30%
15%
20%
10%
10%
5%
0%
0%
cg.B
lu.C
bt.B
ft.B
sp.B
Inter-processor shared heap relative to total heap
Performance improvement over first-touch
17
Inter-processor data sharing
Shared heap / total heap [%]
Performance improvement [%]
60%
30%
50%
25%
40%
20%
30%
15%
20%
10%
10%
5%
0%
0%
cg.B
lu.C
bt.B
ft.B
sp.B
Inter-processor shared heap relative to total heap
Performance improvement over first-touch
18
Automatic page placement
 Profile-based page placement often ineffective
 Reason: inter-processor data sharing
 Inter-processor data sharing is a program property
 Detailed look: program memory access patterns
 Loop-parallel programs with OpenMP-like parallelization
 Matrix processing
 NAS BT
19
Matrix processing
m[NX][NY]
Process m sequentially
NX
for (i=0; i<NX; i++)
for (j=0; j<NY; j++)
// access m[i][j]
NY
20
Matrix processing
m[NX][NY]
Process m x-wise parallel
T0
T1
T2
T3
NX
T4
#pragma omp parallel for
for (i=0; i<NX; i++)
for (j=0; j<NY; j++)
// access m[i][j]
T5
T6
T7
NY
21
Thread scheduling
Remember: fixed thread-to-core mapping
T0
Processor 0
Processor 1
DRAM
DRAM
T1
T2
T3
T4
T5
T6
T7
22
Matrix processing
m[NX][NY]
Process m x-wise parallel
T0
T1
T2
Allocated at
Processor 0
T3
NX
T4
T5
T6
#pragma omp parallel for
for (i=0; i<NX; i++)
for (j=0; j<NY; j++)
// access m[i][j]
Allocated at
Processor 1
T7
NY
23
Matrix processing
m[NX][NY]
T0
T1
T2
Allocated at
Processor 0
T3
T4
Process m y-wise
x-wise parallel
T5
T6
Allocated at
Processor 1
T7
NX
for (i=0;
#pragma
omp
i<NX;
parallel
i++) for
for (i=0;
#pragma
omp
i<NX;
parallel
i++) for
for (j=0; j<NY; j++)
// access m[i][j]
NY
24
Example: NAS BT
m[NX][NY]
T0
T1
T2
T3
T4
Time-step iteration
T5
T6
T7
T0
T1
T2
T3
NX
T4
T5
for (t=0; t<TMAX; t++)
{
x_wise();
y_wise();
}
T6
T7
NY
25
Example: NAS BT
m[NX][NY]
T0
T0
T1
T1
T2
T3
T4
Time-step iteration
T5
T6
Allocated at
Processor 0
Appropriate
allocation
not possible
Appropriate
allocation
not possible
Allocated at
Processor 1
T2
T7
T3
T4
T5
T6
NX
for (t=0; t<TMAX; t++)
{
x_wise();
y_wise();
}
T7
NY
Result:
Inter-processor shared heap: 35%
Remote accesses: 19%
26
Solution?
1. Adjust data placement
High overhead of runtime data migration cancels benefit
2. Adjust iteration scheduling
Limited by data dependences
3. Adjust data placement and iteration scheduling together
27
API
 Library for data placement
 Set of common data distributions
 Affinity-aware loop iteration scheduling
 Extension to GCC OpenMP implementation
 Example use case: NAS BT
28
Use-case: NAS BT
 Remember: BT has two incompatible access patterns
 Repeated x-wise and y-wise access to the same data
 Idea: data placement to accommodate both access patterns
Allocated at
Processor 0
Blocked-exclusive
data placement
Allocated at
Processor 1
NX
Allocated at
Processor 1
Allocated at
Processor 0
NY
29
Use-case: NAS BT
distr_t *distr;
distr = block_exclusive_distr( m, sizeof(m), sizeof(m[0]/2));
distribute_to(distr);
for (t=0; t<TMAX; t++)
{
x_wise();
y_wise();
}
30
Use-case: NAS BT
distr_t *distr;
distr = block_exclusive_distr( m, sizeof(m), sizeof(m[0]/2));
distribute_to(distr);
for (t=0; t<TMAX; t++)
{
x_wise();
#pragma omp parallel for
for (i=0; i<NX; i++)
y_wise();
for (j=0; j<NY; j++)
}
//access m[i][j]
31
x_wise()
Matrix processed in two steps
Step 1: left half
all accesses local
Step 2: right half
all accesses local
T0
T1
T2
Allocated at
Processor 0
Allocated at
Processor 1
T3
NX
T4
T5
T6
Allocated at
Processor 1
Allocated at
Processor 0
T7
NY / 2
NY / 2
32
Use-case: NAS BT
distr_t *distr;
distr = block_exclusive_distr( m, sizeof(m), sizeof(m[0]/2));
distribute_to(distr);
for (t=0; t<TMAX; t++)
{
x_wise();
#pragma omp parallel for
for (i=0; i<NX; i++)
for (j=0; j<NY/2;
j<NY; j++)
j++)
//access m[i][j]
#pragma omp parallel for
for (i=0; i<NX; i++)
for (j=NY/2; j<NY; j++)
//access m[i][j]
y_wise();
}
33
Use-case: NAS BT
distr_t *distr;
distr = block_exclusive_distr( m, sizeof(m), sizeof(m[0]/2));
distribute_to(distr);
for (t=0; t<TMAX; t++)
{
x_wise();
#pragma omp parallel for schedule(static)
for (i=0; i<NX; i++)
for (j=0; j<NY/2; j++)
//access m[i][j]
#pragma omp parallel for schedule(static-inverse)
for (i=0; i<NX; i++)
for (j=NY/2; j<NY; j++)
//access m[i][j]
y_wise();
}
34
Matrix processing
m[NX][NY]
Process m x-wise parallel
T0
T1
T2
T3
NX
T4
T5
#pragma omp parallel for
forschedule(static)
(i=0; i<NX; i++)
forfor
(i=0;
(j=0;
i<NX;
j<NY;
i++)
j++)
for //
(j=0;
j<NY;
j++)
access
m[i][j]
// access m[i][j]
T6
T7
NY
35
Matrix processing
m[NX][NY]
Process m x-wise parallel
T0
m[0
..
NX/8 - 1][*]
T1
m[NX/8
.. 2*NX/8 - 1][*]
T2
m[2*NX/8.. 3*NX/8 - 1][*]
T3
m[3*NX/8.. 4*NX/8 - 1][*]
T4
m[4*NX/8.. 5*NX/8 - 1][*]
T5
m[5*NX/8 ..6*NX/8 - 1][*]
T6
m[6*NX/8 ..7*NX/8 - 1][*]
T7
m[7*NX/8 ..
NX
#pragma omp parallel for
forschedule(static)
(i=0; i<NX; i++)
forfor
(i=0;
(j=0;
i<NX;
j<NY;
i++)
j++)
for //
(j=0;
j<NY;
j++)
access
m[i][j]
// access m[i][j]
NX - 1][*]
NY
36
static vs. static-inverse
#pragma omp parallel for
schedule(static)
for (i=0; i<NX; i++)
for (j=0; j<NY; j++)
// access m[i][j]
#pragma omp parallel for
schedule(static-inverse)
for (i=0; i<NX; i++)
for (j=0; j<NY; j++)
// access m[i][j]
T0
m[0
..
NX/8 - 1][*]
T1
m[NX/8
.. 2*NX/8 - 1][*]
T2
m[2*NX/8 .. 3*NX/8 - 1][*]
T3
m[3*NX/8 .. 4*NX/8 - 1][*]
T4
m[4*NX/8 .. 5*NX/8 - 1][*]
T5
m[5*NX/8 .. 6*NX/8 - 1][*]
T6
m[6*NX/8 .. 7*NX/8 - 1][*]
T7
m[7*NX/8 ..
T0
m[0
..
T1
m[NX/8
.. 2*NX/8 - 1][*]
T2
m[2*NX/8 .. 3*NX/8 - 1][*]
T3
m[3*NX/8 .. 4*NX/8 - 1][*]
T4
m[4*NX/8 .. 5*NX/8 - 1][*]
T5
m[5*NX/8 .. 6*NX/8 - 1][*]
T6
m[6*NX/8 .. 7*NX/8 - 1][*]
T7
m[7*NX/8 ..
NX - 1][*]
NX/8 - 1][*]
NX - 1][*]
37
y_wise()
T0
T1
T2
T3
T4
T5
T6
Matrix processed in two steps
T7
Allocated at
Processor 0
Allocated at
Processor 1
NX / 2
Allocated at
Processor 1
Allocated at
Processor 0
NX / 2
Step 1: upper half
all accesses local
Step 2: lower half
all accesses local
NY
38
Outline
 Profile-based page placement
 Memory access patterns
 Matching data distribution and iteration scheduling
 Evaluation
 Conclusions
39
Evaluation
Performance improvement over first-touch [%]
25%
20%
15%
10%
5%
0%
cg.B
lu.C
Profile-based allocation
bt.B
ft.B
sp.B
Program transformations
40
Evaluation
Performance improvement over first-touch [%]
25%
20%
15%
10%
5%
0%
cg.B
lu.C
Profile-based allocation
bt.B
ft.B
sp.B
Program transformations
41
Evaluation
Performance improvement over first-touch [%]
25%
20%
15%
10%
5%
0%
cg.B
lu.C
Profile-based allocation
bt.B
ft.B
sp.B
Program transformations
42
Scalability
Machine: 4-processor 32-core Intel Xeon E7-4830
Performance improvement over first-touch [%]
250%
200%
150%
100%
50%
0%
cg.C
lu.C
bt.C
ft.C
sp.C
43
Scalability
Machine: 4-processor 32-core Intel Xeon E7-4830
Performance improvement over first-touch [%]
250%
200%
150%
100%
50%
0%
cg.C
lu.C
bt.C
ft.C
sp.C
44
Conclusions
 Automatic data placement (still) limited
 Alternating memory access patterns
 Inter-processor data sharing
 Match memory access patterns and data placement
 Simple API: practical solution that works today
 Ample opportunities for further improvement
45
Thank you for your attention!
46