Transcript Profiling

Profiling, Performance Tuning, and
Design Issues
Basic Efficiency Guidelines
• Select best algorithm.
– How to know? Scalable? Portable?
• Use efficient libraries when possible
• Compiler optimizations.
• Code Optimization
1
Compiler Options for producing
the Fastest Executable
• Using optimization flags when compiling can greatly
reduce the runtime of an executable.
• Each compiler has a different set of options for creating the
fastest executable .
• Often the best compiler options can only be arrived at by
empirical testing and timing of your code.
• A good reference for compiler flags that can be used with
various architectures is the SPEC web site www.spec.org.
• Read the Compiler manpages.
• GNU: -O3 –ffast-math –funroll-loops
2
Optimizing Memory Access
• Memory access more of performance
bottleneck than processor speed
• Largest potential for performance
improvement
• Access data to minimize out-of-cache
memory use
3
Memory Latencies
• CPU register: 0 cycles
• L1 cache hit: 2-3 cycles
• L1 cache miss satisfied by L2 cache hit: 8-12
cycles
• L2 cache miss satisfied from main memory, no
TLB miss: 75-250 cycles
• TLB miss requiring only reload of the TLB:
~2000 cycles
• TLB miss requiring reload of virtual page – page
fault: hundreds of millions of cycles
4
Other Code Optimizations
•
•
•
•
•
•
•
•
Copy Propagation
Constant Folding
Dead Code Removal
Induction Variable
Simplification
Function Inlining
Loop Invariant
Conditionals
Variable RenamingLoop
Invariant Code Motion
Loop Fusion
•Pushing Loops inside Subroutines
•Loop Index Dependent
Conditionals
•Loop Unrolling
•Loop Stride Size
•Floating Point Optimizations
•Faster Algorithms
•External Libraries
•Assembly Code
•Lookup Tables
5
Code Optimization References
• Software Optimizations for High Performance
Computing by Crawford and Wadleigh
• High Performance Computing by Kevin
Dowd et al
• Performance Optimization for Numerically
Intensive Codes by Goedecker and Hoisie
6
Timing and Profiling Codes
• Need to know where to focus attention
• “Premature Optimization is the root of all evil”
– Donald Knuth
• The “80-20 rule” – codes generally spend 80% of their
time executing 20% of their instructions
• flat profile shows how much time your program spent in
each function, and how many times that function was
called.
• call graph shows, for each function, which functions called
it, which other functions it called, and how many times.
• annotated source listing is a copy of the program's source
code, labeled with the number of times each line of the
program was executed.
7
GNU gprof
• The first step in generating profile information for your program is to
compile and link it with profiling enabled – use the `-pg' option when
you run the compiler. (This is in addition to the options you normally
use.)
• The `-pg' option also works with a command that both compiles and
links:
cc -o myprog myprog.c utils.c -g -pg
Execute code in normal manner
./myprog
Create profile with gprof
gprof myprog > myprog.prof
8
Profiling on the Beowulf Cluster
• Compile:
pgf77 -Mprof=func program.f
pgcc -Mprof=func program.c
• Run the code:
– To produce a profile data file called pgprof.out.
• View the execution profile:
– pgprof pgprof.out
9
Pgprof (without x windows)
Loading....
Datafile : pgprof.out
Processes : 1
pgprof> print
Time/ Function
Calls
Call(%) Time(%) Cost(%) Name:
-----------------------------------------------------------------------4100500 0.00
23.43
23
lxi (cdnz3d.f:1632)
4100500 0.00
21.90
22
damping (cdnz3d.f:2319)
4100500 0.00
21.87
22
leta (cdnz3d.f:1790)
4100500 0.00
11.68
12
lzeta (cdnz3d.f:1947)
4100500 0.00
11.24
33
sum (cdnz3d.f:2107)
250
0.02
5.99
97
page (cdnz3d.f:1527)
250
0.01
2.79
3
tmstep (cdnz3d.f:678)
pgprof> quit
10
Overview of PAPI
• Performance Application Programming Interface
• The purpose of the PAPI project is to design,
standardize and implement a portable and efficient
API to access the hardware performance monitor
counters found on most modern microprocessors.
• Parallel Tools Consortium project
http://www.ptools.org/
11
PAPI Counter Interfaces
•
PAPI provides three interfaces to the underlying
counter hardware:
1. The low level interface manages hardware events in user
defined groups called EventSets.
2. The high level interface simply provides the ability to
start, stop and read the counters for a specified list of
events.
3. Graphical tools to visualize information.
12
13
Parallel Communication Profiling
A significant factor that affects the performance of a parallel application is the
balance between communication and workload.
The challenge of the message passing model is in reducing message traffic over
the interconnection network. Performance analysis tools are needed.
Two such tools:
VAMPIR
http://www.pallas.com
uses the profile extensions to MPI and permits analysis of the message events
where data is transmitted between processors during execution of a parallel
program. It has user-interface with zooming and filtering.
PARAVER http://www.cepba.upc.es/
was developed to respond to the basic need to have a qualitative perception of the
application behavior by visual inspection and then to be able to focus on the
detailed quantitative analysis of the problems.
14
15
16