Transcript Document
USC INFORMATION SCIENCES INSTITUTE An Open64-based Compiler Approach to Performance Prediction and Performance Sensitivity Analysis for Scientific Codes Jeremy Abramson and Pedro C. Diniz University of Southern California / Information Sciences Institute 4676 Admiralty Way, Suite 1001 Marina del Rey, California 90292 USC Motivation INFORMATION SCIENCES INSTITUTE • Performance analysis is conceptually easy – Just run the program! • The “what” of performance. Is this interesting? – Is that realistic? • Huge programs with large data sets – “Uncertainty principle” and intractability of profiling/instrumenting • Performance prediction and analysis is in practice very hard – Not just interested in wall clock time • The “why” of performance is a big concern • How to accurately characterize program behavior? • What about architecture effects? – Can’t reuse wall clock time – Can reuse program characteristics USC Motivation (2) INFORMATION SCIENCES INSTITUTE • What about the future? • Different architecture = better results? • Compiler transformations (loop unrolling) • Need a fast, scalable, automated way of determining program characteristics – Determine what causes poor performance • What does profiling tell us? • How can the programmer use profiling (low-level) information? USC Overview INFORMATION SCIENCES INSTITUTE • Approach – High level / low level synergy – Not architecture-bound • Experimental results – CG core • Caveats and future work • Conclusion USC INFORMATION SCIENCES Low versus High level information INSTITUTE la $r0, a lw $r1 i mult $offset, $r1, 4 add $offset, $offset, $r0 lw $r2, $offset add $r3, $r2, 1 la $r4, b or sw $r4, $r3 • Which can provide meaningful performance information to a programmer? • How do we capture the information at a low level while maintaining the structure of high level source? USC INFORMATION SCIENCES Low versus High level information (2) INSTITUTE • Drawbacks of looking at low-level – Too much data! – You found a “problem” spot. What now? • How do programmers relate information back to source level? • Drawbacks of looking at source-level – What about the compiler? • Code may look very different – Architecture impacts? • Solution: Look at high-level structure, try to anticipate compiler USC Experimental Approach INFORMATION SCIENCES INSTITUTE • Goal: Derive performance expectations from source code for different architectures – What should the performance be and why? – What is limiting the performance? • Data-dependencies? • Architecture limitations? • Use high level information – WHIRL intermediate representation in Open64 • Arrays not lowered • Construct DFG – Decorate graph with latency information • Schedule the DFG – Compute as-soon-as-possible schedule – Variable number of functional units • ALU, Load/Store, Registers • Pipelining of operations USC INFORMATION SCIENCES Compilation process INSTITUTE OPR_STID: B for (i; i < 0; … … B = A[i] + 1 … 1. Source (C/Fortran) OPR_ADD OPR_ARRAY OPR_LDA: A OPR_LDID: i OPR_CONST: 1 2. Open64 WHIRL (High-level) 3. Annotated DFG USC INFORMATION SCIENCES Memory modeling approach INSTITUTE i is a loop induction variable Array node represents address calculation at a high level Array expression is affine. Assume a cache hit, and assign latency accordingly 0 Register hit? Assign latency USC INFORMATION SCIENCES Example: CG INSTITUTE do 200 j = 1, n xj = x(j) do 100 k = colstr(j) , colstr(j+1)-1 y(rowidx(k)) = y(rowidx(k)) + a(k) + xj 100 continue 200 continue USC CG Analysis Results INFORMATION SCIENCES INSTITUTE 70.00 64.29 66 Cycles per iteration 60.00 47.29 50.00 40.13 40.00 All iterations Outer Loop 30.00 20.00 18.74 10.00 0.00 Optimized Code ( -O3 compiler switch) Non-optimized code Generated Prediction Figure 4. Validation results of CG on a MIPS R10000 machine Prediction results consistent with un-optimized version of the code USC CG Analysis Results (2) INFORMATION SCIENCES INSTITUTE 70 69 68 Cycles 67 66 1 Floating Point Unit 65 5 Floating Point Units 64 63 62 61 1 LSU 2 LSU 3 LSU 4 LSU 5 LSU Load/Store Units Figure 5. Cycle time for an iteration of CG with varying architectural configurations • What’s the best way to use processor space? – Pipelined ALUs? – Replicate standard ALUs? USC Caveats, Future Work INFORMATION SCIENCES INSTITUTE • More compiler-like features are needed to improve accuracy – Control flow • Implement trace scheduling • Multiple-paths can give upper/lower performance bounds – Simple compiler transformations • Common sub-expression elimination • Strength reduction • Constant folding – Register allocation • “Distance”-based methods? • Anticipate cache for spill code – Software pipelining? • Unrolling exploits ILP • Run-time data? – Array references, loop trip counts, access patterns from performance skeletons USC INFORMATION SCIENCES Conclusions INSTITUTE • SLOPE provides very fast performance prediction and analysis results • High-level approach gives more meaningful information – Still try to anticipate compiler and memory hierarchy • More compiler transformations to be added – Maintain high-level approach, refine low-level accuracy USC INFORMATION SCIENCES INSTITUTE An Open64-based Compiler Approach to Performance Prediction and Performance Sensitivity Analysis for Scientific Codes Jeremy Abramson and Pedro C. Diniz University of Southern California / Information Sciences Institute 4676 Admiralty Way, Suite 1001 Marina del Rey, California 90292