Transcript Document

USC
INFORMATION
SCIENCES
INSTITUTE
An Open64-based Compiler Approach to
Performance Prediction and Performance
Sensitivity Analysis for Scientific Codes
Jeremy Abramson and Pedro C. Diniz
University of Southern California / Information Sciences Institute
4676 Admiralty Way, Suite 1001
Marina del Rey, California 90292
USC
Motivation
INFORMATION
SCIENCES
INSTITUTE
• Performance analysis is conceptually easy
– Just run the program!
• The “what” of performance. Is this interesting?
– Is that realistic?
• Huge programs with large data sets
– “Uncertainty principle” and intractability of profiling/instrumenting
• Performance prediction and analysis is in practice very hard
– Not just interested in wall clock time
• The “why” of performance is a big concern
• How to accurately characterize program behavior?
• What about architecture effects?
– Can’t reuse wall clock time
– Can reuse program characteristics
USC
Motivation (2)
INFORMATION
SCIENCES
INSTITUTE
• What about the future?
• Different architecture = better results?
• Compiler transformations (loop unrolling)
• Need a fast, scalable, automated way of determining
program characteristics
– Determine what causes poor performance
• What does profiling tell us?
• How can the programmer use profiling (low-level) information?
USC
Overview
INFORMATION
SCIENCES
INSTITUTE
• Approach
– High level / low level synergy
– Not architecture-bound
• Experimental results
– CG core
• Caveats and future work
• Conclusion
USC
INFORMATION
SCIENCES
Low versus High level information
INSTITUTE
la $r0, a
lw $r1 i
mult $offset, $r1, 4
add $offset, $offset, $r0
lw $r2, $offset
add $r3, $r2, 1
la $r4, b
or
sw $r4, $r3
• Which can provide meaningful performance information to
a programmer?
• How do we capture the information at a low level while
maintaining the structure of high level source?
USC
INFORMATION
SCIENCES
Low versus High level information (2)
INSTITUTE
• Drawbacks of looking at low-level
– Too much data!
– You found a “problem” spot. What now?
• How do programmers relate information back to source level?
• Drawbacks of looking at source-level
– What about the compiler?
• Code may look very different
– Architecture impacts?
• Solution: Look at high-level structure, try to anticipate
compiler
USC
Experimental Approach
INFORMATION
SCIENCES
INSTITUTE
• Goal: Derive performance expectations from source code for different
architectures
– What should the performance be and why?
– What is limiting the performance?
• Data-dependencies?
• Architecture limitations?
• Use high level information
– WHIRL intermediate representation in Open64
• Arrays not lowered
• Construct DFG
– Decorate graph with latency information
• Schedule the DFG
– Compute as-soon-as-possible schedule
– Variable number of functional units
• ALU, Load/Store, Registers
• Pipelining of operations
USC
INFORMATION
SCIENCES
Compilation process
INSTITUTE
OPR_STID: B
for (i; i < 0; …
…
B = A[i] + 1
…
1. Source (C/Fortran)
OPR_ADD
OPR_ARRAY
OPR_LDA: A
OPR_LDID: i
OPR_CONST: 1
2. Open64 WHIRL (High-level)
3. Annotated DFG
USC
INFORMATION
SCIENCES
Memory modeling approach
INSTITUTE
i is a loop induction
variable
Array node represents
address calculation at a
high level
Array expression is
affine. Assume a
cache hit, and assign
latency accordingly
0
Register hit? Assign
latency
USC
INFORMATION
SCIENCES
Example: CG
INSTITUTE
do 200 j = 1, n
xj = x(j)
do 100 k = colstr(j) , colstr(j+1)-1
y(rowidx(k)) = y(rowidx(k)) + a(k) + xj
100 continue
200 continue
USC
CG Analysis Results
INFORMATION
SCIENCES
INSTITUTE
70.00
64.29
66
Cycles per iteration
60.00
47.29
50.00
40.13
40.00
All iterations
Outer Loop
30.00
20.00
18.74
10.00
0.00
Optimized Code ( -O3
compiler switch)
Non-optimized code
Generated Prediction
Figure 4. Validation results of CG on a MIPS R10000 machine
Prediction results consistent with un-optimized version of the code
USC
CG Analysis Results (2)
INFORMATION
SCIENCES
INSTITUTE
70
69
68
Cycles
67
66
1 Floating Point Unit
65
5 Floating Point Units
64
63
62
61
1 LSU
2 LSU
3 LSU
4 LSU
5 LSU
Load/Store Units
Figure 5. Cycle time for an iteration of CG with varying architectural configurations
• What’s the best way to use processor space?
– Pipelined ALUs?
– Replicate standard ALUs?
USC
Caveats, Future Work
INFORMATION
SCIENCES
INSTITUTE
• More compiler-like features are needed to improve accuracy
– Control flow
• Implement trace scheduling
• Multiple-paths can give upper/lower performance bounds
– Simple compiler transformations
• Common sub-expression elimination
• Strength reduction
• Constant folding
– Register allocation
• “Distance”-based methods?
• Anticipate cache for spill code
– Software pipelining?
• Unrolling exploits ILP
• Run-time data?
– Array references, loop trip counts, access patterns from performance
skeletons
USC
INFORMATION
SCIENCES
Conclusions
INSTITUTE
• SLOPE provides very fast performance prediction and
analysis results
• High-level approach gives more meaningful information
– Still try to anticipate compiler and memory hierarchy
• More compiler transformations to be added
– Maintain high-level approach, refine low-level accuracy
USC
INFORMATION
SCIENCES
INSTITUTE
An Open64-based Compiler Approach to
Performance Prediction and Performance
Sensitivity Analysis for Scientific Codes
Jeremy Abramson and Pedro C. Diniz
University of Southern California / Information Sciences Institute
4676 Admiralty Way, Suite 1001
Marina del Rey, California 90292