Extensions to Structure Layout Optimizations in the Open64 Compiler

Download Report

Transcript Extensions to Structure Layout Optimizations in the Open64 Compiler

Extensions to Structure Layout
Optimizations in the Open64 Compiler
Michael Lai
AMD
Related Work
• Structure splitting, structure peeling, structure
field reordering (Hagog & Tice, Hundt,
Mannarswamy & Chakrabarti)
• Above implemented in the Open64 Compiler
(Chakrabarti & Chow)
• Structure instance interleaving (Truong, Bodin &
Seznec)
• Data splitting (Curial, Zhao & Amaral)
• Array reshaping (Zhao, Cui, Gao, Silvera & Amaral)
Current Framework
source
source
frontend
source
frontend
frontend
WHIRL
WHIRL
WHIRL
ipl
ipl
ipl
.o
.o
.o
ipa_link
WHIRL
WHIRL
Instance Interleaving
a[0].field_1
a[0].field_2 an “instance” of the structure
a[0].field_3
a[1].field_1
a[1].field_2 another “instance” of the structure
a[1].field_3
...
Instance Interleaving
a[0].field_1
a[1].field_1
…
a[0].field_2
a[1].field_2
…
a[0].field_3
a[1].field_3
...
field_1 of all the instances are
interleaved together
field_2 of all the instances are
interleaved together
field_3 of all the instances are
interleaved together
Instance Interleaving
array[0].field_1
array[0].field_2
array[0].field_3
…
array[0].field_m
array[1].field_1
array[1].field_2
array[1].field_3
…
array[1].field_m
…
array[n-1].field_1
array[n-1].field_2
array[n-1].field_3
…
array[n-1].field_m
array[0].field_1
array[1].field_1
array[2].field_1
…
array[n-1].field_1
array[0].field_2
array[1].field_2
array[2].field_2
…
array[n-1].field_2
…
array[0].field_m
array[1].field_m
array[2].field_m
…
array[n-1].field_m
Implementation
• Profitability analysis (done in ipl)
– During ipl compilation of each source file, access
patterns of structure fields are analyzed and their
usage statistics recorded
– After all the functions have been compiled by ipl,
the “most likely to benefit” structure (if any) is
marked and passed to ipo
– (By way of illustration, the ideal structure is one
with many fields, each of which appearing in its
own hot loop)
Implementation
• Legality analysis (done in ipo)
– Usual checking for address taken, escaped types,
etc.
• Code transformation (done in ipo)
– Create internal pointers ptr_1, ptr_2, …, ptr_m to
keep track of the m locations array[0].field_1,
array[0].field_2, …, array[0].field_m
– Rewrite array[i].field_j to ptr_j[i], if “i” is known;
otherwise, incur additional overhead to compute
“i”
Instance Interleaving
array[0].field_1
array[0].field_2
array[0].field_3
…
array[0].field_m
array[1].field_1
array[1].field_2
array[1].field_3
…
array[1].field_m
…
array[n-1].field_1
array[n-1].field_2
array[n-1].field_3
…
array[n-1].field_m
array[0].field_1
array[1].field_1
array[2].field_1
…
array[n-1].field_1
array[0].field_2
array[1].field_2
array[2].field_2
…
array[n-1].field_2
…
array[0].field_m
array[1].field_m
array[2].field_m
…
array[n-1].field_m
array[i].field_j becomes ptr_j[i]
= ptr_1
= ptr_2
= ptr_m
Array Remapping
field_1
a[0]
field_2
a[1]
iteration 0
field_3
a[2]
…
…
field_m
a[m-1]
field_1
a[m]
field_2
a[m+1]
iteration 1
field_3
a[m+2]
…
…
field_m
a[2m-1]
…
…
field_1
a[(n-1)m]
field_2 a[(n-1)m+1]
iteration n-1 field_3 a[(n-1)m+2]
…
…
field_m
a[nm-1]
a[0]
a[1]
a[2]
…
a[n-1]
a[n]
a[n+1]
a[n+2]
…
a[2n-1]
…
a[(m-1)n]
a[(m-1)n+1]
a[(m-1)n+2]
…
a[mn-1]
field_1
field_1
field_1
…
field_1
field_2
field_2
field_2
…
field_2
…
field_m
field_m
field_m
…
field_m
iteration 0
iteration 1
iteration 2
iteration n-1
iteration 0
iteration 1
iteration 2
iteration n-1
iteration 0
iteration 1
iteration 2
iteration n-1
Implementation
• Profitability analysis (done in ipl)
– During ipl compilation of each source file, discover
if there are arrays that behave like structures and
suffer poor data cache utilization at the same time
– After all the functions have been compiled by ipl,
the “most likely to benefit” arrays (if any) are
marked and passed to ipo
– For each of these arrays, record the stride, group
size, and array size associated with it
Implementation
• Legality analysis (done in ipo)
– Check for array aliasing, address taken, argument
passing, etc.
• Code transformation (done in ipo)
– Construct the array remapping permutation
alpha(i) = (i % m) * n + (i / m), where m is the
group size and n is the number of such groups
– Rewrite a[i] to a[alpha(i)]
Array Remapping
field_1
a[0]
field_2
a[1]
iteration 0
field_3
a[2]
…
…
field_m
a[m-1]
field_1
a[m]
field_2
a[m+1]
iteration 1
field_3
a[m+2]
…
…
field_m
a[2m-1]
…
…
field_1
a[(n-1)m]
field_2 a[(n-1)m+1]
iteration n-1 field_3 a[(n-1)m+2]
…
…
field_m
a[nm-1]
a[0]
a[1]
a[2]
…
a[n-1]
a[n]
a[n+1]
a[n+2]
…
a[2n-1]
…
a[(m-1)n]
a[(m-1)n+1]
a[(m-1)n+2]
…
a[mn-1]
a[i] becomes a[(i%m)*n+(i/m)]
field_1
field_1
field_1
…
field_1
field_2
field_2
field_2
…
field_2
…
field_m
field_m
field_m
…
field_m
iteration 0
iteration 1
iteration 2
iteration n-1
iteration 0
iteration 1
iteration 2
iteration n-1
iteration 0
iteration 1
iteration 2
iteration n-1
Performance Results
AMD system
speed (1-copy) run
rate (12-copy) run
462.libquantum (structure
peeling)
+6.35%
+43.43%
429.mcf (instance
interleaving)
+2.43%
+38.38%
470.lbm (array remapping)
-16.35% (degradation)
+138.55%
Intel system
speed (1-copy) run
rate (4-copy) run
462.libquantum (structure
peeling)
+7.01%
+24.30%
429.mcf (instance
interleaving)
-6.04% (degradation)
+34.62%
470.lbm (array remapping)
-23.28% (degradation)
+119.51%
Future Work
• Integrate existing structure layout
optimizations with the new structure instance
interleaving work
• Combine profitability heuristics of all structure
layout optimizations
• Extend structure instance interleaving
optimization to more than one structure
• Extend array remapping optimization to multidimensional arrays
References
1.
2.
3.
4.
G. Chakrabarti and F. Chow. “Structure Layout Optimizations in
the Open64 Compiler.” Proceedings of the Open64 Workshop,
Boston, 2008.
M. Hagog and C. Tice. “Cache Aware Data Layout Reorganization
Optimization in gcc.” Proceedings of the gcc Developers Summit,
2005.
R. Hundt, S. Mannarswamy, and D.R. Chakrabarti. “Practical
Structure Layout Optimization and Advice.” Proceedings of the
International Symposium on Code Generation and Optimization,
New York, 2006.
D.N. Truong, F. Bodin, and A. Seznec. “Improving Cache Behavior
of Dynamically Allocated Data Structures.” Proceedings of the
International Conference on Parallel Architectures and
Compilation Techniques, Washington D.C., 1998.