Performance Analysis Tools CS267, March 30, 2010 Karl Fuerlinger fuerling@eecs.berkeley.edu With slides from David Skinner, Sameer Shende, Shirley Moore, Bernd Mohr, Felix Wolf, Hans Christian Hoppe.

Performance Analysis Tools CS267, March 30, 2010 Karl Fuerlinger [email protected] With slides from David Skinner, Sameer Shende, Shirley Moore, Bernd Mohr, Felix Wolf, Hans Christian Hoppe.

Transcript Performance Analysis Tools CS267, March 30, 2010 Karl Fuerlinger [email protected] With slides from David Skinner, Sameer Shende, Shirley Moore, Bernd Mohr, Felix Wolf, Hans Christian Hoppe.

Performance Analysis Tools
CS267, March 30, 2010
Karl Fuerlinger
[email protected]
With slides from David Skinner, Sameer
Shende, Shirley Moore, Bernd Mohr, Felix Wolf,
Hans Christian Hoppe and others.
Outline

Motivation

Concepts and definitions
– Instrumentation, monitoring, analysis

Some tools and their functionality
– PAPI – access to hardware performance counters
– ompP – profiling OpenMP code
– IPM – monitoring message passing applications

(Backup Slides)
– Vampir
– Kojak/Scalasca
– TAU
Karl Fuerlinger
CS267 - Performance Analysis Tools | 2
Motivation
 Performance analysis is important
– For HPC: computer systems are large investments
• Procurement: O($40 Mio)
• Operational costs: ~$5 Mio per year
• Power: 1 MWyear ~$1 Mio
– Goals:
• Solve larger problems (new science)
• Solve problems faster (turn-around time)
• Improve error bounds on solutions (confidence)
Karl Fuerlinger
CS267 - Performance Analysis Tools | 3
Concepts and Definitions

The typical performance optimization cycle
Code Development
Functionally
complete and
correct program
Instrumentation
Measure
Analyze
Modify / Tune
Complete, correct and wellperforming
program
Usage / Production
Karl Fuerlinger
CS267 - Performance Analysis Tools | 4
Instrumentation




Instrumentation := adding
measurement probes to the code
in order to observe its execution
Can be done on several levels
and ddifferent techniques for
different levels
Different overheads and levels of
accuracy with each technique
User-level abstractions
problem domain
source code
instrumentation
preprocessor
instrumentation
source code
object code
libraries
instrumentation
linker
No application instrumentation
needed: run in a simulator. E.g.,
Valgrind, SIMICS, etc. but
simulation speed is an issue
executable
instrumentation
OS
instrumentation
runtime image
instrumentation
VM
instrumentation
performance
run
data
Karl Fuerlinger
instrumentation
compiler
CS267 - Performance Analysis Tools | 5
Instrumentation – Examples (1)


Library Instrumentation:
MPI library interposition
–
–
–
–
Karl Fuerlinger
All functions are available under two names: MPI_Xxx and PMPI_Xxx,
MPI_Xxx symbols are weak, can be over-written by interposition library
Measurement code in the interposition library measures begin, end, transmitted data, etc… and calls corresponding PMPI routine.
Not all MPI functions need to be instrumented
CS267 - Performance Analysis Tools | 6
Instrumentation – Examples (2)

Preprocessor Instrumentation
– Example: Instrumenting OpenMP constructs with Opari
– Preprocessor operation
Orignial
source code
Preprocessor
Modified (instrumented)
source code
– Example: Instrumentation of a parallel region
POMP_Parallel_fork [master]
#pragma omp parallel {
POMP_Parallel_begin [team]
/* user code in parallel region */
/* user code in parallel region */
}
POMP_Barrier_enter [team]
#pragma omp barrier
POMP_Barrier_exit [team]
POMP_Parallel_end [team]
}
POMP_Parallel_join [master]
Karl Fuerlinger
This approach is used for
OpenMP instrumentation by
most vendor-independent
tools. Examples:
TAU/Kojak/Scalasca/ompP
Instrumentation
added by Opari
CS267 - Performance Analysis Tools | 7
Instrumentation – Examples (3)

Source code instrumentation
– User-added time measurement, etc. (e.g., printf(), gettimeofday())
– Think twice before you roll your own solution, many tools expose mechanisms for source code instrumentation in addition to
automatic instrumentation mechanisms
Instrument program phases:
• Initialization
• main loop iteration 1,2,3,4,...
• data post-processing
– Pragma and pre-processor based
#pragma pomp inst begin(foo)
// application code
#pragma pomp inst end(foo)
– Macro / function call based
ELG_USER_START("name");
// application code
– ELG_USER_END("name");
Karl Fuerlinger
CS267 - Performance Analysis Tools | 8
Instrumentation – Examples (4)

Compiler Instrumentation
–
–
–
–
–
Many compilers can instrument functions automatically
GNU compiler flag: -finstrument-functions
Automatically calls functions on function entry/exit that a tool can capture
Not standardized across compilers, often undocumented flags, sometimes not available at all
GNU compiler example:
void __cyg_profile_func_enter(void *this, void *callsite)
{
/* called on function entry */
}
void __cyg_profile_func_exit(void *this, void *callsite)
{
/* called just before returning from function */
}
Karl Fuerlinger
CS267 - Performance Analysis Tools | 9
Instrumentation – Examples (5)

Binary Runtime Instrumentation
– Dynamic patching while the program executes
– Example: Paradyn tool, Dyninst API

Base trampolines/Mini trampolines
–
–
–

Binary instrumentation is difficult
–
Figure by Skylar Byrd Rampersaud

PIN: Open Source dynamic binary
instrumenter from Intel
Karl Fuerlinger
Base trampolines handle storing current state of program so instrumentations
do not affect execution
Mini trampolines are the machine-specific realizations of predicates and
primitives
One base trampoline may handle many mini-trampolines, but a base trampoline
is needed for every instrumentation point
Have to deal with
• Compiler optimizations
• Branch delay slots
• Different sizes of instructions for x86 (may increase the number of
instructions that have to be relocated)
• Creating and inserting mini trampolines somewhere in program (at
end?)
• Limited-range jumps may complicate this
CS267 - Performance Analysis Tools | 10
Measurement

Profiling vs. Tracing

Profiling
– Summary statistics of performance metrics
• Number of times a routine was invoked
• Exclusive, inclusive time
• Hardware performance counters
• Number of child routines invoked, etc.
• Structure of invocations (call-trees/call-graphs)
• Memory, message communication sizes

Tracing
–
Karl Fuerlinger
When and where events took place along a global timeline
• Time-stamped log of events
• Message communication events (sends/receives) are tracked
• Shows when and from/to where messages were sent
• Large volume of performance data generated usually leads to more perturbation in the program
CS267 - Performance Analysis Tools | 11
Measurement: Profiling

Profiling
– Helps to expose performance bottlenecks and hotspots
– 80/20 –rule or Pareto principle: often 80% of the execution time in 20% of your application
– Optimize what matters, don’t waste time optimizing things that have negligible overall influence on performance

Implementation
– Sampling: periodic OS interrupts or hardware counter traps
• Build a histogram of sampled program counter (PC) values
• Hotspots will show up as regions with many hits
– Measurement: direct insertion of measurement code
• Measure at start and end of regions of interests, compute difference
Karl Fuerlinger
CS267 - Performance Analysis Tools | 12
Profiling: Inclusive vs. Exclusive Time
int main( )
{
f1();
/* other work */
f2();
f1();
/* other work */
}
Karl Fuerlinger
/* takes 100 secs */
/* takes 20 secs */

Inclusive time for main

Exclusive time for main

Exclusive time sometimes called
“self” time

Similar definitions for
inclusive/exclusive time for f1() and
f2()

Similar for other metrics, such as
hardware performance counters, etc
/* takes 50 secs */
/* takes 20 secs */
–
–
100 secs
100-20-50-20=10 secs
CS267 - Performance Analysis Tools | 13
Tracing Example: Instrumentation, Monitor, Trace
Event definitions
Process A:
void master {
trace(ENTER, 1);
...
trace(SEND, B);
send(B, tag, buf);
...
trace(EXIT, 1);
}
timestamp
time
void worker {
trace(ENTER, 2);
...
recv(A, tag, buf);
trace(RECV, A);
...
trace(EXIT, 2);
}
Karl Fuerlinger
master
2
worker
3
...
event
location
context
...
MONITOR
Process B:
1
58
A
ENTER
1
60
B
ENTER
2
62
A
SEND
B
64
A
EXIT
1
68
B
RECV
A
69
B
EXIT
2
...
CS267 - Performance Analysis Tools | 14
Tracing: Timeline Visualization
1
master
2
worker
3
...
main
master
worker
...
58 A
ENTER
1
60 B
ENTER
2
62 A
SEND
B
64 A
EXIT
1
68 B
RECV
A
69 B
EXIT
2
A
B
...
58 60 62 64 66 68 70
Karl Fuerlinger
CS267 - Performance Analysis Tools | 15
Measurement: Tracing

Tracing
– Recording of information about significant points (events) during program execution
• entering/exiting code region (function, loop, block, …)
• thread/process interactions (e.g., send/receive message)
– Save information in event record
• timestamp
• CPU identifier, thread identifier
• Event type and event-specific information
– Event trace is a time-sequenced stream of event records
– Can be used to reconstruct dynamic program behavior
– Typically requires code instrumentation
Karl Fuerlinger
CS267 - Performance Analysis Tools | 16
Performance Data Analysis

Draw conclusions from measured performance data

Manual analysis
–
–
–
–

Visualization
Interactive exploration
Statistical analysis
Modeling
Automated analysis
– Try to cope with huge amounts of performance by automation
– Examples: Paradyn, KOJAK, Scalasca, Periscope
Karl Fuerlinger
CS267 - Performance Analysis Tools | 17
Trace File Visualization

1
Vampir: timeline view
– Similar other tools: Jumpshot, Paraver
2
3
Karl Fuerlinger
CS267 - Performance Analysis Tools | 18
Trace File Visualization

Vampir/IPM: message communication statistics
Karl Fuerlinger
CS267 - Performance Analysis Tools | 19
3D performance data exploration

Paraprof viewer (from the TAU toolset)
Karl Fuerlinger
CS267 - Performance Analysis Tools | 20
Automated Performance Analysis

Reason for Automation
– Size of systems: several tens of thousand of processors
– LLNL Sequoia: 1.6 million cores
– Trend to multi-core

Large amounts of performance data when tracing
– Several gigabytes or even terabytes

Not all programmers are performance experts
– Scientists want to focus on their domain
– Need to keep up with new machines

Automation can solve some of
these issues
Karl Fuerlinger
CS267 - Performance Analysis Tools | 21
Automation - Example
Karl Fuerlinger

„Late sender“ pattern

This pattern can be
detected automatically by
analyzing the trace
CS267 - Performance Analysis Tools | 22
Outline

Motivation

Concepts and definitions
– Instrumentation, monitoring, analysis

Some tools and their functionality
– PAPI – access to hardware performance counters
– ompP – profiling OpenMP code
– IPM – monitoring message passing applications

(Backup Slides)
– Vampir
– Kojak/Scalasca
– TAU
Karl Fuerlinger
CS267 - Performance Analysis Tools | 23
Hardware Performance Counters

Specialized hardware registers to measure the performance of
various aspects of a microprocessor

Originally used for hardware verification purposes

Can provide insight into:

Counters vs. events
–
–
–
–
–
–
Cache behavior
Branching behavior
Memory and resource contention and access patterns
Pipeline stalls
Floating point efficiency
Instructions per cycle
– Usually a large number of countable events (several hundred)
– On a small number of counters (4-18)
– PAPI handles multiplexing if required
Karl Fuerlinger
CS267 - Performance Analysis Tools | 24
What is PAPI

Middleware that provides a consistent and efficient programming
interface for the performance counter hardware found in most major
microprocessors.

Countable events are defined in two ways:
– Platform-neutral Preset Events (e.g., PAPI_TOT_INS)
– Platform-dependent Native Events (e.g., L3_CACHE_MISS)

Preset Events can be derived from multiple Native Events
(e.g. PAPI_L1_TCM might be the sum of L1 Data Misses and L1
Instruction Misses on a given platform)

Preset events are defined in a best-effort way
– No guarantees of semantics portably
– Figuring out what a counter actually counts and if it does so correctly can be hairy
Karl Fuerlinger
CS267 - Performance Analysis Tools | 25
PAPI Hardware Events

Preset Events
–
–
–
–

Standard set of over 100 events for application performance tuning
No standardization of the exact definitions
Mapped to either single or linear combinations of native events on each platform
Use papi_avail to see what preset events are available on a given platform
Native Events
– Any event countable by the CPU
– Same interface as for preset events
– Use papi_native_avail utility to see all available native events

Use papi_event_chooser utility to select a compatible set of
events
Karl Fuerlinger
CS267 - Performance Analysis Tools | 26
PAPI Counter Interfaces

PAPI provides 3 interfaces to the
underlying counter hardware:
–
–
–
A low level API manages hardware events (preset and native) in user
defined groups called EventSets.
Meant for experienced application programmers wanting fine-grained
measurements.
A high level API provides the ability to start, stop and read the counters
for a specified list of events (preset only).
Meant for programmers wanting simple event measurements.
Graphical and end-user tools provide facile data collection and
visualization.
3rd Party and GUI Tools
Low Level
User API
High Level
User API
PAPI PORTABLE LAYER
PAPI HARDWARE SPECIFIC
LAYER
Kernel Extension
Operating System
Perf Counter Hardware
Karl Fuerlinger
CS267 - Performance Analysis Tools | 27
PAPI High Level Calls

PAPI_num_counters()
–

PAPI_flips (float *rtime, float *ptime, long long *flpins, float *mflips)
–

copy current counts to array and reset counters
PAPI_start_counters (int *events, int array_len)
–

add current counts to array and reset counters
PAPI_read_counters (long long *values, int array_len)
–

gets instructions per cycle, real and processor time
PAPI_accum_counters (long long *values, int array_len)
–

simplified call to get Mflops/s (floating point operation rate), real and processor time
PAPI_ipc (float *rtime, float *ptime, long long *ins, float *ipc)
–

simplified call to get Mflips/s (floating point instruction rate), real and processor time
PAPI_flops (float *rtime, float *ptime, long long *flpops, float *mflops)
–

get the number of hardware counters available on the system
start counting hardware events
PAPI_stop_counters (long long *values, int array_len)
–
Karl Fuerlinger
stop counters and return current counts
CS267 - Performance Analysis Tools | 28
PAPI Example Low Level API Usage
#include "papi.h”
#define NUM_EVENTS 2
int Events[NUM_EVENTS]={PAPI_FP_OPS,PAPI_TOT_CYC},
int EventSet;
long long values[NUM_EVENTS];
/* Initialize the Library */
retval = PAPI_library_init (PAPI_VER_CURRENT);
/* Allocate space for the new eventset and do setup */
retval = PAPI_create_eventset (&EventSet);
/* Add Flops and total cycles to the eventset */
retval = PAPI_add_events (&EventSet,Events,NUM_EVENTS);
/* Start the counters */
retval = PAPI_start (EventSet);
do_work();
/* What we want to monitor*/
/*Stop counters and store results in values */
retval = PAPI_stop (EventSet,values);
Karl Fuerlinger
CS267 - Performance Analysis Tools | 29
Using PAPI through tools

You can use PAPI directly in your application, but most
people use it through tools

Tool might have a predfined set of counters or lets you
select counters through a configuration file/environment
variable, etc.

Tools using PAPI
–
–
–
–
–
–
–
Karl Fuerlinger
TAU (UO)
PerfSuite (NCSA)
HPCToolkit (Rice)
KOJAK, Scalasca (FZ Juelich, UTK)
Open|Speedshop (SGI)
ompP (UCB)
IPM (LBNL)
CS267 - Performance Analysis Tools | 30
Component PAPI Design
– Re-Implementation of PAPI w/ support for multiple monitoring domains
Low
Level
API
Hi
Level
API
PAPI Framework Layer
PAPI Component Layer
(network)
Kernel Patch
Operating System
Perf Counter Hardware
Karl Fuerlinger
Devel
API
PAPI Component Layer
(CPU)
Kernel Patch
Operating System
Perf Counter Hardware
PAPI Component Layer
(thermal)
Kernel Patch
Operating System
Perf Counter Hardware
CS267 - Performance Analysis Tools | 31
Outline

Motivation

Concepts and definitions
– Instrumentation, monitoring, analysis

Some tools and their functionality
– PAPI – access to hardware performance counters
– ompP – profiling OpenMP code
– IPM – monitoring message passing applications

(Backup Slides)
– Vampir
– Kojak/Scalasca
– TAU
Karl Fuerlinger
CS267 - Performance Analysis Tools | 32
OpenMP Performance Analysis with ompP

ompP: Profiling tool for OpenMP
–
–
–
Based on source code instrumentation
Independent of the compiler and runtime used
Tested and supported: Linux, Solaris, AIX and Intel,
Pathscale, PGI, IBM, gcc, SUN studio compilers
–
Supports HW counters through PAPI
–
Uses source code instrumenter Opari from
the KOJAK/Scalasca toolset
–
Available for download (GPL): http://www.ompp-tool.com
Automatic instrumentation of OpenMP
constructs, manual region instrumentation
Source Code
ompP library
Settings (env. Vars)
HW Counters,
output format,…
Karl Fuerlinger
Executable
Execution on
parallel machine
Profiling Report
CS267 - Performance Analysis Tools | 33
OpenMP

OpenMP
– Threads and fork/join based programming model
– Worksharing constructs
Master
Thread
Parallel Regions

Characteristics
–
–
–
–
–
Karl Fuerlinger
Directive based (compiler pragmas, comments)
Incremental parallelization approach
Well suited for loop-based parallel programming
Less well suited for irregular parallelism (but tasking included in version 3.0 of the OpenMP specification).
One of the contending programming paradigms for the “mutlicore era”
CS267 - Performance Analysis Tools | 34
ompP’s Profiling Report

Header
– Date, time, duration of the run, number of threads, used hardware counters,…

Region Overview
– Number of OpenMP regions (constructs) and their source-code locations

Flat Region Profile
– Inclusive times, counts, hardware counter data

Callgraph

Callgraph Profiles
– With Inclusive and exclusive times

Overhead Analysis Report
– Four overhead categories
– Per-parallel region breakdown
– Absolute times and percentages
Karl Fuerlinger
CS267 - Performance Analysis Tools | 35
Profiling Data

Example profiling data
Code:
Profile:
#pragma omp parallel
{
#pragma omp critical
{
sleep(1.0);
}
}
R00002 main.c (34-37) (default) CRITICAL
TID
execT
execC
bodyT
enterT
0
3.00
1
1.00
2.00
1
1.00
1
1.00
0.00
2
2.00
1
1.00
1.00
3
4.00
1
1.00
3.00
SUM
10.01
4
4.00
6.00

exitT
0.00
0.00
0.00
0.00
0.00
PAPI_TOT_INS
1595
6347
1595
1595
11132
Components:
–
–
–
–
–
Karl Fuerlinger
Source code location and type of region
Timing data and execution counts, depending on the particular construct
One line per thread, last line sums over all threads
Hardware counter data (if PAPI is available and HW counters are selected)
Data is “exact” (measured, not based on sampling)
CS267 - Performance Analysis Tools | 36
Flat Region Profile (2)

Times and counts reported by ompP for various OpenMP constructs
Ends with T: time
Ends with C: count
Main =
enter +
body +
barr +
exit
Karl Fuerlinger
CS267 - Performance Analysis Tools | 37
Callgraph

Callgraph View
– ‘Callgraph’ or ‘region stack’ of OpenMP constructs
– Functions can be included by using Opari’s mechanism to instrument user defined regions: #pragma pomp inst begin(…),
#pragma pomp inst end(…)

Callgraph profile
– Similar to flat profile, but with inclusive/exclusive times

Example:
main()
{
#pragma omp parallel
{
foo1();
foo2();
}
}
Karl Fuerlinger
void foo1()
{
bar();
}
void foo2()
{
bar();
}
void bar()
{
#pragma omp critical
{
sleep(1.0);
}
}
CS267 - Performance Analysis Tools | 38
Callgraph (2)

Callgraph display
Incl. CPU time
32.22
32.06
10.02
10.02
16.03
16.03

(100.0%)
(99.50%)
(31.10%)
(31.10%)
(49.74%)
(49.74%)
PARALLEL
USERREG
CRITICAL
USERREG
CRITICAL
[APP 4 threads]
+-R00004 main.c (42-46)
|-R00001 main.c (19-21) ('foo1')
| +-R00003 main.c (33-36) (unnamed)
+-R00002 main.c (26-28) ('foo2')
+-R00003 main.c (33-36) (unnamed)
Callgraph profiles
[*00] critical.ia64.ompp
[+01] R00004 main.c (42-46) PARALLEL
[+02] R00001 main.c (19-21) ('foo1') USER REGION
TID
execT/I
execT/E
execC
0
1.00
0.00
1
1
3.00
0.00
1
2
2.00
0.00
1
3
4.00
0.00
1
SUM
10.01
0.00
4
[*00]
[+01]
[+02]
[=03]
TID
0
1
2
3
SUM
Karl Fuerlinger
critical.ia64.ompp
R00004 main.c (42-46) PARALLEL
R00001 main.c (19-21) ('foo1') USER REGION
R00003 main.c (33-36) (unnamed) CRITICAL
execT
execC
bodyT/I
bodyT/E
1.00
1
1.00
1.00
3.00
1
1.00
1.00
2.00
1
1.00
1.00
4.00
1
1.00
1.00
10.01
4
4.00
4.00
enterT
0.00
2.00
1.00
3.00
6.00
exitT
0.00
0.00
0.00
0.00
0.00
CS267 - Performance Analysis Tools | 39
Overhead Analysis (1)

Certain timing categories reported by ompP can be classified as
overheads:
– Example: exitBarT: time wasted by threads idling at the exit barrier of work-sharing constructs. Reason is most likely an
imbalanced amount of work

Four overhead categories are defined in ompP:
– Imbalance: waiting time incurred due to an imbalanced amount of work in a worksharing or parallel region
– Synchronization: overhead that arises due to threads having to synchronize their activity, e.g. barrier call
– Limited Parallelism: idle threads due not enough parallelism being exposed by the program
– Thread management: overhead for the creation and destruction of threads, and for signaling critical sections, locks as available
Karl Fuerlinger
CS267 - Performance Analysis Tools | 40
Overhead Analysis (2)
S:
Synchronization overhead
M: Thread management overhead
Karl Fuerlinger
I: Imbalance overhead
L: Limited Parallelism overhead
CS267 - Performance Analysis Tools | 41
ompP’s Overhead Analysis Report
------------------------------------------------------------------------ompP Overhead Analysis Report
------------------------------------------------------------------------------------------------Total runtime (wallclock)
: 172.64 sec [32 threads]
Number of threads,
Number of parallel regions : 12
Parallel coverage
: 134.83 sec (78.10%)
parallel
regions, parallel coverage
Parallel regions sorted by wallclock time:
Type
Location
R00011 PARALL
mgrid.F (360-384)
R00019 PARALL
mgrid.F (403-427)
R00009 PARALL
mgrid.F (204-217)
...
SUM
Wallclock time x number of threads
Wallclock (%)
55.75 (32.29)
23.02 (13.34)
11.94 ( 6.92)
134.83 (78.10)
Overhead percentages wrt. this
particular
region(%) +
Imbal
(%) parallel
+
Limpar
Overheads wrt. each individual parallel region:
Total
Ovhds (%) =
Synch (%)
R00011 1783.95
337.26 (18.91)
0.00 ( 0.00)
R00019
736.80
129.95 (17.64)
0.00 ( 0.00)
R00009
382.15
183.14 (47.92)
0.00 ( 0.00)
R00015
276.11
68.85 (24.94)
0.00 ( 0.00)
...
+
305.75
104.28
96.47
51.15
Overheads wrt. whole program:
Total
Ovhds (%)
R00011 1783.95
337.26 ( 6.10)
R00009
382.15
183.14 ( 3.32)
R00005
264.16
164.90 ( 2.98)
R00007
230.63
151.91 ( 2.75)
...
SUM 4314.62 1277.89 (23.13)
Synch (%)
0.00 ( 0.00)
0.00 ( 0.00)
0.00 ( 0.00)
0.00 ( 0.00)
+ Imbal
305.75 (
96.47 (
63.92 (
68.58 (
0.00 ( 0.00)
872.92 (15.80)
=
(17.14)
(14.15)
(25.24)
(18.52)
(%)
5.53)
1.75)
1.16)
1.24)
0.00
0.00
0.00
0.00
+
(
(
(
(
0.00)
0.00)
0.00)
0.00)
31.51
25.66
86.67
17.70
Mgmt (%)
( 1.77)
( 3.48)
(22.68)
( 6.41)
Limpar (%)
0.00 ( 0.00)
0.00 ( 0.00)
0.00 ( 0.00)
0.00 ( 0.00)
+
Mgmt (%)
31.51 ( 0.57)
86.67 ( 1.57)
100.98 ( 1.83)
83.33 ( 1.51)
0.00 ( 0.00)
404.97 ( 7.33)
Overhead percentages wrt. whole
program
Karl Fuerlinger
CS267 - Performance Analysis Tools | 42
OpenMP Scalability Analysis

Methodology
–
–
–
–
Classify execution time into “Work” and four overhead categories: “Thread Management”, “Limited Parallelism”, “Imbalance”, “Synchronization”
Analyze how overheads behave for increasing thread counts
Graphs show accumulated runtime over all threads for fixed workload (strong scaling)
Horizontal line = perfect (linear) scalability
Imperfect scaling
Accumulated time
Wallclock time
Perfect (linear)
scaling
Super-linear scaling
1
Karl Fuerlinger
2
3
4
Thread Count
1
2
3
4
Thread Count
CS267 - Performance Analysis Tools | 43
SPEC OpenMP Benchmarks (1)

Application 314.mgrid_m
–
–
–
Karl Fuerlinger
Scales relatively poorly, application has 12 parallel loops, all contribute with increasingly severe load imbalance
Markedly smaller load imbalance for thread counts of 32 and 16. Only three loops show this behavior
In all three cases, the iteration count is always a power of two (2 to 256), hence thread counts which are not a power of two
exhibit more load imbalance
CS267 - Performance Analysis Tools | 44
SPEC OpenMP Benchmarks (2)

Application 316.applu
–
–
–
Super-linear speedup
Only one parallel region (ssor.f 138-209) shows super-linear speedup, contributes 80% of accumulated total execution time
Most likely reason for super-linear speedup: increased overall cache size
L3_MISSES
L3_MISSES
16000000000
14000000000
12000000000
10000000000
8000000000
6000000000
4000000000
2000000000
0
2
4
8
12
16
20
24
28
32
Number of Threads
Karl Fuerlinger
CS267 - Performance Analysis Tools | 45
SPEC OpenMP Benchmarks (3)

Application 313.swim
–
–
–
–
Karl Fuerlinger
Dominating source of inefficiency is thread management overhead
Main source: reduction of three scalar variables in a small parallel loop in swim.f 116-126.
At 128 threads more than 6 percent of the total accumulated runtime is spent in the reduction operation
Time for the reduction operation is larger than time spent in the body of the parallel region
CS267 - Performance Analysis Tools | 46
SPEC OpenMP Benchmarks (4)

Application 318.galgel
– Scales very poorly, large fraction of overhead not accounted for by ompP (most likely memory access latency, cache conflicts, false
sharing)
– lapack.f90 5081-5092 contributes significantly to the bad scaling
• accumulated CPU time increases from 107.9 (2 threads) to 1349.1 seconds (32 threads)
• 32 thread version is only 22% faster than 2 thread version (wall-clock time)
• 32 thread version parallel efficiency is only approx. 8%
Whole application
Karl Fuerlinger
Region lapack.f90 5081-5092
CS267 - Performance Analysis Tools | 47
Incremental Profiling (1)

Profiling vs. Tracing
– Profiling:
•
•
•
– Tracing:
•
•
•
•

low overhead
small amounts of data
easy to comprehend, even as simple ASCII text
Large quantities of data
hard to comprehend manually
allows temporal phenomena to be explained
causal relationship of events are preserved
Idea: Combine advantages of profiling and tracing
– Add a temporal dimension to profiling-type performance data
– See what happens during the execution without capturing full traces
– Manual interpretation becomes harder since a new dimension is added to the performance data
Karl Fuerlinger
CS267 - Performance Analysis Tools | 48
Incremental Profiling (2)

Implementation:
– Capture and dump profiling reports not only at the end of the execution but several times while the
application executes
– Analyze how profiling reports change over time
– Capture points need not be regular
“One-shot” Profiling
time
Incremental Profiling
Karl Fuerlinger
CS267 - Performance Analysis Tools | 49
Incremental Profiling (3)

Possible triggers for capturing profiles:
– Timer-based, fixed: capture profiles in regular, uniform intervals: predictable storage requirements (depends only on duration of
program run, size of dataset).
– Timer-based, adaptive: Adapt the capture rate to the behavior of the application: dump often if application behavior changes,
decrease rate if application behavior stays the same
– Counter overflow based: Dump a profile if a hardware counter overflows. Interesting for floating point intensive application
– User-added: Expose API for dumping profiles to the user aligned to outer loop iterations or phase boundaries
Karl Fuerlinger
CS267 - Performance Analysis Tools | 50
Incremental Profiling

Trigger currently implemented in ompP:
–
–
–
–
–

Capture profiles in regular intervals
Timer signal is registered and delivered to profiler
Profiling data up to capture point stored to memory buffer
Dumped as individual profiling reports at the end of program execution
Perl scripts to analyze reports and generate graphs
Experiments
– 1 second regular dump interval
– SPEC OpenMP benchmark suite
• Medium variant, 11 applications
– 32 CPU SGI Altix machine
• Itanium-2 processors with 1.6 GHz and 6 MB L3 cache
• Used in batch mode
Karl Fuerlinger
CS267 - Performance Analysis Tools | 51
Incremental Profiling Profiling: Data Views (2)

Overheads over time
–
–
–
–
See how overheads change over the application run
How is each Δt (1sec) spent for work or for one of the overhead classes:
Either for whole program or for a specific parallel region
Total incurred overhead=integral under this function
Application: 328.fma3d_m
Initialization in a critical section, effectively
serializing the execution for approx. 15
seconds. Overhead=31/32=96%
Karl Fuerlinger
CS267 - Performance Analysis Tools | 52
Incremental Profiling

Performance counter heatmaps
–
–
–
–
Karl Fuerlinger
x-axis: Time, y-axis: Thread-ID
Color: number of hardware counter events observed during sampling period
Application “applu”, medium-sized variant, counter: LOADS_RETIRED
Visible phenomena: iterative behavior, thread grouping (pairs)
CS267 - Performance Analysis Tools | 53
Outline

Motivation

Concepts and definitions
– Instrumentation, monitoring, analysis

Some tools and their functionality
– PAPI – access to hardware performance counters
– ompP – profiling OpenMP code
– IPM – monitoring message passing applications

(Backup Slides)
– Vampir
– Kojak/Scalasca
– TAU
Karl Fuerlinger
CS267 - Performance Analysis Tools | 54
IPM – Integrated Performance Monitoring

IPM provides a performance profile of a job
– „Flip of a switch“ operation
– http://ipm-hpc.sourceforge.net
Input_123
Output_123
Karl Fuerlinger
IPM
Job_123
Profile_123
CS267 - Performance Analysis Tools | 55
IPM: Design Goals

Provide high-level performance profile
– Event inventory: which events happened and how much time did they take
– How much time in communication operations, how much time in OpenMP parallel regions, how much time in file I/O
– Less focus on drill-down into application than other tools

Efficiency
– Constant memory footprint (approx. 1-2 MB per MPI rank)
– Monitorig data is kept in a hash-table, avoids dynamic memory allocation
– Low CPU overhead: 1-2 %

Ease of use
– HTML, or ASCII-based based output format
– Flip of a switch, no recompilation, no user instrumentation
– Portability
Karl Fuerlinger
CS267 - Performance Analysis Tools | 56
IPM: Methodology

MPI_Init()
– Initialize monitoring environment, allocate memory

For each MPI call
– Compute hash key from
• Type of call (send/recv/bcast/...)
• Buffer size (in bytes)
• Communication partner rank
• Call-site, region or phase identifier, ...
– Store / update value in hash table with timing data
• Number of invocations
• Minimum duration, maximum duration, summed time

MPI_Finalize()
– Aggregate, report to stdout, write XML log
Karl Fuerlinger
CS267 - Performance Analysis Tools | 57
Using IPM: Basics


Do “module load ipm”, then run normally
Upon completion you get
##IPMv0.85################################################################
#
# command : ../exe/pmemd -O -c inpcrd -o res (completed)
# host
: s05405
mpi_tasks : 64 on 4 nodes
# start
: 02/22/05/10:03:55
wallclock : 24.278400 sec
# stop
: 02/22/05/10:04:17
%comm
: 32.43
# gbytes : 2.57604e+00 total
gflop/sec : 2.04615e+00 total
#
###########################################################################
Maybe that’s enough. If so you’re done.
Have a nice day.
Karl Fuerlinger
CS267 - Performance Analysis Tools | 58
Want more detail? IPM_REPORT=full
##IPMv0.85#####################################################################
#
# command : ../exe/pmemd -O -c inpcrd -o res (completed)
# host
: s05405
mpi_tasks : 64 on 4 nodes
# start
: 02/22/05/10:03:55
wallclock : 24.278400 sec
# stop
: 02/22/05/10:04:17
%comm
: 32.43
# gbytes : 2.57604e+00 total
gflop/sec : 2.04615e+00 total
#
#
[total]
<avg>
min
max
# wallclock
1373.67
21.4636
21.1087
24.2784
# user
936.95
14.6398
12.68
20.3
# system
227.7
3.55781
1.51
5
# mpi
503.853
7.8727
4.2293
9.13725
# %comm
32.4268
17.42
41.407
# gflop/sec
2.04614
0.0319709
0.02724
0.04041
# gbytes
2.57604
0.0402507
0.0399284
0.0408173
# gbytes_tx
0.665125
0.0103926
1.09673e-05
0.0368981
# gbyte_rx
0.659763
0.0103088
9.83477e-07
0.0417372
#
Karl Fuerlinger
CS267 - Performance Analysis Tools | 59
Want more detail? IPM_REPORT=full
# PM_CYC
3.00519e+11
4.69561e+09
4.50223e+09
5.83342e+09
# PM_FPU0_CMPL
2.45263e+10
3.83223e+08
3.3396e+08
5.12702e+08
# PM_FPU1_CMPL
1.48426e+10
2.31916e+08
1.90704e+08
2.8053e+08
# PM_FPU_FMA
1.03083e+10
1.61067e+08
1.36815e+08
1.96841e+08
# PM_INST_CMPL
3.33597e+11
5.21245e+09
4.33725e+09
6.44214e+09
# PM_LD_CMPL
1.03239e+11
1.61311e+09
1.29033e+09
1.84128e+09
# PM_ST_CMPL
7.19365e+10
1.12401e+09
8.77684e+08
1.29017e+09
# PM_TLB_MISS
1.67892e+08
2.62332e+06
1.16104e+06
2.36664e+07
#
#
[time]
[calls]
<%mpi>
<%wall>
# MPI_Bcast
352.365
2816
69.93
22.68
# MPI_Waitany
81.0002
185729
16.08
5.21
# MPI_Allreduce
38.6718
5184
7.68
2.49
# MPI_Allgatherv
14.7468
448
2.93
0.95
# MPI_Isend
12.9071
185729
2.56
0.83
# MPI_Gatherv
2.06443
128
0.41
0.13
# MPI_Irecv
1.349
185729
0.27
0.09
# MPI_Waitall
0.606749
8064
0.12
0.04
# MPI_Gather
0.0942596
192
0.02
0.01
###############################################################################
Karl Fuerlinger
CS267 - Performance Analysis Tools | 60
IPM: XML log files

There’s a lot more information in the logfile than you get to stdout. A
logfile is written that has the hash table, switch traffic, memory usage,
executable information, ...

Parallelism in writing of the log (when possible)

The IPM logs are durable performance profiles serving
– HPC center production needs: https://www.nersc.gov/nusers/status/llsum/
http://www.sdsc.edu/user_services/top/ipm/
– HPC research: ipm_parse renders txt and html
http://www.nersc.gov/projects/ipm/ex3/
– your own XML consuming entity, feed, or process
Karl Fuerlinger
CS267 - Performance Analysis Tools | 61
Message Sizes : CAM 336 way
per MPI call
Karl Fuerlinger
per MPI call & buffer size
CS267 - Performance Analysis Tools | 62
Scalability: Required
32K tasks AMR code
Karl Fuerlinger
CS267 - Performance Analysis Tools | 63
More than a pretty picture
Karl Fuerlinger
CS267 - Performance Analysis Tools | 64
Application Assessment with IPM

Provide high level performance numbers with small overhead

What’s going on overall in my code?

How is my load balance?
– To get an initial read on application runtimes
– For allocation/reporting
– To check the performance weather on systems with high variability
– How much comp, comm, I/O?
– Where to start with optimization?
– Domain decomposition vs. concurrency (M work on N tasks)
Karl Fuerlinger
CS267 - Performance Analysis Tools | 65
When to reach for another tool

Full application tracing

Looking for hotspots on the statement level in code

Data structure level detail

Automated performance feedback
Karl Fuerlinger
CS267 - Performance Analysis Tools | 66
What‘s wrong here?
Karl Fuerlinger
CS267 - Performance Analysis Tools | 67
Is MPI_Barrier time bad? Probably. Is it avoidable?


The stray / unknown / debug barrier
Barriers used for I/O ordering
Karl Fuerlinger
CS267 - Performance Analysis Tools | 68
Summary

Performance montioring concepts
– Instrument, measure, analyze
– Profiling/tracing, sampling, direct measurment

Tools
– PAPI, ompP, and IPM as examples

Lots of other tools
– Vendor tools: Cray PAT, Sun Studio, Intel Thread Profiler, Vtune, PTU,…
– Portable tools: TAU, Perfsuite, Paradyn, HPCToolkit, Kojak, Scalasca, Vampir, oprofile, gprof, …
Karl Fuerlinger
CS267 - Performance Analysis Tools | 69
Documentation, Manuals, User Guides

PAPI
– http://icl.cs.utk.edu/papi/
– ompP
– http://www.ompp-tool.com

IPM
– http://ipm-hpc.sourceforge.net/
– TAU
– http://www.cs.uoregon.edu/research/tau/

VAMPIR
– http://www.vampir-ng.de/
– Scalasca
– http://www.scalasca.org
Karl Fuerlinger
Thank you for your
attention!
CS267 - Performance Analysis Tools | 70
BACKUP
Karl Fuerlinger
SLIDES
CS267 - Performance Analysis Tools | 71
Vampir
Karl Fuerlinger
– Trace Visualization
CS267 - Performance Analysis Tools | 72
Vampir overview statistics

Aggregated profiling information
– Execution time
– Number of calls

This profiling information is computed from the trace
– Change the selection in main timeline window

Inclusive or exclusive of called routines
Karl Fuerlinger
CS267 - Performance Analysis Tools | 73
Timeline display

To zoom, mark region with the mouse
Karl Fuerlinger
CS267 - Performance Analysis Tools | 74
Timeline display – message details
Message
information
Click on
message line
Message
send op
Karl Fuerlinger
Message
receive op
CS267 - Performance Analysis Tools | 77
Communication statistics

Message statistics for each process/node pair:
– Byte and message count
– min/max/avg message length, bandwidth
Karl Fuerlinger
CS267 - Performance Analysis Tools | 78
Message histograms

Message statistics by length, tag or communicator
– Byte and message count
– Min/max/avg bandwidth
Karl Fuerlinger
CS267 - Performance Analysis Tools | 79
Collective operations

For each process: mark operation locally
Stop of op
Start of op

Data being sent
Connect start/stop points by lines
Data being received
Connection lines
Karl Fuerlinger
CS267 - Performance Analysis Tools | 80
Activity chart

Profiling information for all processes
Karl Fuerlinger
CS267 - Performance Analysis Tools | 81
Process–local displays



Timeline (showing calling levels)
Activity chart
Calling tree (showing number of calls)
Karl Fuerlinger
CS267 - Performance Analysis Tools | 82
Effects of zooming
Updated
message
statistics
Updated
summary
Select one
iteration
Karl Fuerlinger
CS267 - Performance Analysis Tools | 83
 KOJAK
Karl Fuerlinger
/ Scalasca
CS267 - Performance Analysis Tools | 84
Basic Idea


“Traditional” Tool
Automatic Tool
Simple:
1 screen +
2 commands +
3 panes
Relevant
problems
and data
Huge amount of
Measurement data


For non-standard /
tricky cases (10%)
For expert users



For standard cases (90% ?!)
For “normal” users
Starting point for experts
 More productivity for performance analysis process!
Karl Fuerlinger
CS267 - Performance Analysis Tools | 85
MPI-1 Pattern: Wait at Barrier

Time spent in front of MPI synchronizing operation such as barriers
Karl Fuerlinger
CS267 - Performance Analysis Tools | 86
location
MPI-1 Pattern: Late Sender / Receiver
MPI_Send
MPI_Recv
MPI_Send
MPI_Irecv
MPI_Wait
time
location

Late Sender: Time lost waiting caused by a blocking receive operation posted earlier than the corresponding
send operation
MPI_Send
MPI_Recv
MPI_Send
MPI_Irecv
MPI_Wait
time

Late Receiver: Time lost waiting in a blocking send operation until the corresponding receive operation is
called
Karl Fuerlinger
CS267 - Performance Analysis Tools | 87
Performance Property
What problem?
Karl Fuerlinger
Region Tree
Where in source code?
In what context?
Color Coding
Location
How is the
problem distributed
across the machine?
CS267 - Performance Analysis Tools | 88
How severe is the problem?
KOJAK: sPPM run on (8x16x14) 1792 PEs
Karl Fuerlinger

Topology
display

Shows
distribution
of pattern
over HW
topology

Easily
scales to
even
larger
systems
CS267 - Performance Analysis Tools | 89
TAU
Karl Fuerlinger
CS267 - Performance Analysis Tools | 90
TAU Parallel Performance System

http://www.cs.uoregon.edu/research/tau/

Multi-level performance instrumentation
–
Multi-language automatic source instrumentation

Flexible and configurable performance measurement

Widely-ported parallel performance profiling system
–
–

Support for multiple parallel programming paradigms
–

Computer system architectures and operating systems
Different programming languages and compilers
Multi-threading, message passing, mixed-mode, hybrid
Integration in complex software, systems, applications
Karl Fuerlinger
CS267 - Performance Analysis Tools | 91
ParaProf – 3D Scatterplot (Miranda)



Each point
is a “thread”
of execution
A total of
four metrics
shown in
relation
ParaVis 3D
profile
visualization
library
–
JOGL
32k processors
Karl Fuerlinger
CS267 - Performance Analysis Tools | 92
ParaProf – 3D Scatterplot (SWEEP3D CUBE)
Karl Fuerlinger
CS267 - Performance Analysis Tools | 93
PerfExplorer - Cluster Analysis


Four significant events automatically selected (from 16K processors)
Clusters and correlations are visible
Karl Fuerlinger
CS267 - Performance Analysis Tools | 94
PerfExplorer - Correlation Analysis (Flash)

Describes strength and direction of a linear relationship between two
variables (events) in the data
Karl Fuerlinger
CS267 - Performance Analysis Tools | 95
PerfExplorer - Correlation Analysis (Flash)
 -0.995 indicates strong,
negative relationship
 As CALC_CUT_
BLOCK_CONTRIBUTIONS()
increases in execution time,
MPI_Barrier() decreases
Karl Fuerlinger
CS267 - Performance Analysis Tools | 96