Performance Analysis Tools CS267, March 30, 2010 Karl Fuerlinger [email protected] With slides from David Skinner, Sameer Shende, Shirley Moore, Bernd Mohr, Felix Wolf, Hans Christian Hoppe.
Download ReportTranscript Performance Analysis Tools CS267, March 30, 2010 Karl Fuerlinger [email protected] With slides from David Skinner, Sameer Shende, Shirley Moore, Bernd Mohr, Felix Wolf, Hans Christian Hoppe.
Performance Analysis Tools CS267, March 30, 2010 Karl Fuerlinger [email protected] With slides from David Skinner, Sameer Shende, Shirley Moore, Bernd Mohr, Felix Wolf, Hans Christian Hoppe and others. Outline Motivation Concepts and definitions – Instrumentation, monitoring, analysis Some tools and their functionality – PAPI – access to hardware performance counters – ompP – profiling OpenMP code – IPM – monitoring message passing applications (Backup Slides) – Vampir – Kojak/Scalasca – TAU Karl Fuerlinger CS267 - Performance Analysis Tools | 2 Motivation Performance analysis is important – For HPC: computer systems are large investments • Procurement: O($40 Mio) • Operational costs: ~$5 Mio per year • Power: 1 MWyear ~$1 Mio – Goals: • Solve larger problems (new science) • Solve problems faster (turn-around time) • Improve error bounds on solutions (confidence) Karl Fuerlinger CS267 - Performance Analysis Tools | 3 Concepts and Definitions The typical performance optimization cycle Code Development Functionally complete and correct program Instrumentation Measure Analyze Modify / Tune Complete, correct and wellperforming program Usage / Production Karl Fuerlinger CS267 - Performance Analysis Tools | 4 Instrumentation Instrumentation := adding measurement probes to the code in order to observe its execution Can be done on several levels and ddifferent techniques for different levels Different overheads and levels of accuracy with each technique User-level abstractions problem domain source code instrumentation preprocessor instrumentation source code object code libraries instrumentation linker No application instrumentation needed: run in a simulator. E.g., Valgrind, SIMICS, etc. but simulation speed is an issue executable instrumentation OS instrumentation runtime image instrumentation VM instrumentation performance run data Karl Fuerlinger instrumentation compiler CS267 - Performance Analysis Tools | 5 Instrumentation – Examples (1) Library Instrumentation: MPI library interposition – – – – Karl Fuerlinger All functions are available under two names: MPI_Xxx and PMPI_Xxx, MPI_Xxx symbols are weak, can be over-written by interposition library Measurement code in the interposition library measures begin, end, transmitted data, etc… and calls corresponding PMPI routine. Not all MPI functions need to be instrumented CS267 - Performance Analysis Tools | 6 Instrumentation – Examples (2) Preprocessor Instrumentation – Example: Instrumenting OpenMP constructs with Opari – Preprocessor operation Orignial source code Preprocessor Modified (instrumented) source code – Example: Instrumentation of a parallel region POMP_Parallel_fork [master] #pragma omp parallel { POMP_Parallel_begin [team] /* user code in parallel region */ /* user code in parallel region */ } POMP_Barrier_enter [team] #pragma omp barrier POMP_Barrier_exit [team] POMP_Parallel_end [team] } POMP_Parallel_join [master] Karl Fuerlinger This approach is used for OpenMP instrumentation by most vendor-independent tools. Examples: TAU/Kojak/Scalasca/ompP Instrumentation added by Opari CS267 - Performance Analysis Tools | 7 Instrumentation – Examples (3) Source code instrumentation – User-added time measurement, etc. (e.g., printf(), gettimeofday()) – Think twice before you roll your own solution, many tools expose mechanisms for source code instrumentation in addition to automatic instrumentation mechanisms Instrument program phases: • Initialization • main loop iteration 1,2,3,4,... • data post-processing – Pragma and pre-processor based #pragma pomp inst begin(foo) // application code #pragma pomp inst end(foo) – Macro / function call based ELG_USER_START("name"); // application code – ELG_USER_END("name"); Karl Fuerlinger CS267 - Performance Analysis Tools | 8 Instrumentation – Examples (4) Compiler Instrumentation – – – – – Many compilers can instrument functions automatically GNU compiler flag: -finstrument-functions Automatically calls functions on function entry/exit that a tool can capture Not standardized across compilers, often undocumented flags, sometimes not available at all GNU compiler example: void __cyg_profile_func_enter(void *this, void *callsite) { /* called on function entry */ } void __cyg_profile_func_exit(void *this, void *callsite) { /* called just before returning from function */ } Karl Fuerlinger CS267 - Performance Analysis Tools | 9 Instrumentation – Examples (5) Binary Runtime Instrumentation – Dynamic patching while the program executes – Example: Paradyn tool, Dyninst API Base trampolines/Mini trampolines – – – Binary instrumentation is difficult – Figure by Skylar Byrd Rampersaud PIN: Open Source dynamic binary instrumenter from Intel Karl Fuerlinger Base trampolines handle storing current state of program so instrumentations do not affect execution Mini trampolines are the machine-specific realizations of predicates and primitives One base trampoline may handle many mini-trampolines, but a base trampoline is needed for every instrumentation point Have to deal with • Compiler optimizations • Branch delay slots • Different sizes of instructions for x86 (may increase the number of instructions that have to be relocated) • Creating and inserting mini trampolines somewhere in program (at end?) • Limited-range jumps may complicate this CS267 - Performance Analysis Tools | 10 Measurement Profiling vs. Tracing Profiling – Summary statistics of performance metrics • Number of times a routine was invoked • Exclusive, inclusive time • Hardware performance counters • Number of child routines invoked, etc. • Structure of invocations (call-trees/call-graphs) • Memory, message communication sizes Tracing – Karl Fuerlinger When and where events took place along a global timeline • Time-stamped log of events • Message communication events (sends/receives) are tracked • Shows when and from/to where messages were sent • Large volume of performance data generated usually leads to more perturbation in the program CS267 - Performance Analysis Tools | 11 Measurement: Profiling Profiling – Helps to expose performance bottlenecks and hotspots – 80/20 –rule or Pareto principle: often 80% of the execution time in 20% of your application – Optimize what matters, don’t waste time optimizing things that have negligible overall influence on performance Implementation – Sampling: periodic OS interrupts or hardware counter traps • Build a histogram of sampled program counter (PC) values • Hotspots will show up as regions with many hits – Measurement: direct insertion of measurement code • Measure at start and end of regions of interests, compute difference Karl Fuerlinger CS267 - Performance Analysis Tools | 12 Profiling: Inclusive vs. Exclusive Time int main( ) { f1(); /* other work */ f2(); f1(); /* other work */ } Karl Fuerlinger /* takes 100 secs */ /* takes 20 secs */ Inclusive time for main Exclusive time for main Exclusive time sometimes called “self” time Similar definitions for inclusive/exclusive time for f1() and f2() Similar for other metrics, such as hardware performance counters, etc /* takes 50 secs */ /* takes 20 secs */ – – 100 secs 100-20-50-20=10 secs CS267 - Performance Analysis Tools | 13 Tracing Example: Instrumentation, Monitor, Trace Event definitions Process A: void master { trace(ENTER, 1); ... trace(SEND, B); send(B, tag, buf); ... trace(EXIT, 1); } timestamp time void worker { trace(ENTER, 2); ... recv(A, tag, buf); trace(RECV, A); ... trace(EXIT, 2); } Karl Fuerlinger master 2 worker 3 ... event location context ... MONITOR Process B: 1 58 A ENTER 1 60 B ENTER 2 62 A SEND B 64 A EXIT 1 68 B RECV A 69 B EXIT 2 ... CS267 - Performance Analysis Tools | 14 Tracing: Timeline Visualization 1 master 2 worker 3 ... main master worker ... 58 A ENTER 1 60 B ENTER 2 62 A SEND B 64 A EXIT 1 68 B RECV A 69 B EXIT 2 A B ... 58 60 62 64 66 68 70 Karl Fuerlinger CS267 - Performance Analysis Tools | 15 Measurement: Tracing Tracing – Recording of information about significant points (events) during program execution • entering/exiting code region (function, loop, block, …) • thread/process interactions (e.g., send/receive message) – Save information in event record • timestamp • CPU identifier, thread identifier • Event type and event-specific information – Event trace is a time-sequenced stream of event records – Can be used to reconstruct dynamic program behavior – Typically requires code instrumentation Karl Fuerlinger CS267 - Performance Analysis Tools | 16 Performance Data Analysis Draw conclusions from measured performance data Manual analysis – – – – Visualization Interactive exploration Statistical analysis Modeling Automated analysis – Try to cope with huge amounts of performance by automation – Examples: Paradyn, KOJAK, Scalasca, Periscope Karl Fuerlinger CS267 - Performance Analysis Tools | 17 Trace File Visualization 1 Vampir: timeline view – Similar other tools: Jumpshot, Paraver 2 3 Karl Fuerlinger CS267 - Performance Analysis Tools | 18 Trace File Visualization Vampir/IPM: message communication statistics Karl Fuerlinger CS267 - Performance Analysis Tools | 19 3D performance data exploration Paraprof viewer (from the TAU toolset) Karl Fuerlinger CS267 - Performance Analysis Tools | 20 Automated Performance Analysis Reason for Automation – Size of systems: several tens of thousand of processors – LLNL Sequoia: 1.6 million cores – Trend to multi-core Large amounts of performance data when tracing – Several gigabytes or even terabytes Not all programmers are performance experts – Scientists want to focus on their domain – Need to keep up with new machines Automation can solve some of these issues Karl Fuerlinger CS267 - Performance Analysis Tools | 21 Automation - Example Karl Fuerlinger „Late sender“ pattern This pattern can be detected automatically by analyzing the trace CS267 - Performance Analysis Tools | 22 Outline Motivation Concepts and definitions – Instrumentation, monitoring, analysis Some tools and their functionality – PAPI – access to hardware performance counters – ompP – profiling OpenMP code – IPM – monitoring message passing applications (Backup Slides) – Vampir – Kojak/Scalasca – TAU Karl Fuerlinger CS267 - Performance Analysis Tools | 23 Hardware Performance Counters Specialized hardware registers to measure the performance of various aspects of a microprocessor Originally used for hardware verification purposes Can provide insight into: Counters vs. events – – – – – – Cache behavior Branching behavior Memory and resource contention and access patterns Pipeline stalls Floating point efficiency Instructions per cycle – Usually a large number of countable events (several hundred) – On a small number of counters (4-18) – PAPI handles multiplexing if required Karl Fuerlinger CS267 - Performance Analysis Tools | 24 What is PAPI Middleware that provides a consistent and efficient programming interface for the performance counter hardware found in most major microprocessors. Countable events are defined in two ways: – Platform-neutral Preset Events (e.g., PAPI_TOT_INS) – Platform-dependent Native Events (e.g., L3_CACHE_MISS) Preset Events can be derived from multiple Native Events (e.g. PAPI_L1_TCM might be the sum of L1 Data Misses and L1 Instruction Misses on a given platform) Preset events are defined in a best-effort way – No guarantees of semantics portably – Figuring out what a counter actually counts and if it does so correctly can be hairy Karl Fuerlinger CS267 - Performance Analysis Tools | 25 PAPI Hardware Events Preset Events – – – – Standard set of over 100 events for application performance tuning No standardization of the exact definitions Mapped to either single or linear combinations of native events on each platform Use papi_avail to see what preset events are available on a given platform Native Events – Any event countable by the CPU – Same interface as for preset events – Use papi_native_avail utility to see all available native events Use papi_event_chooser utility to select a compatible set of events Karl Fuerlinger CS267 - Performance Analysis Tools | 26 PAPI Counter Interfaces PAPI provides 3 interfaces to the underlying counter hardware: – – – A low level API manages hardware events (preset and native) in user defined groups called EventSets. Meant for experienced application programmers wanting fine-grained measurements. A high level API provides the ability to start, stop and read the counters for a specified list of events (preset only). Meant for programmers wanting simple event measurements. Graphical and end-user tools provide facile data collection and visualization. 3rd Party and GUI Tools Low Level User API High Level User API PAPI PORTABLE LAYER PAPI HARDWARE SPECIFIC LAYER Kernel Extension Operating System Perf Counter Hardware Karl Fuerlinger CS267 - Performance Analysis Tools | 27 PAPI High Level Calls PAPI_num_counters() – PAPI_flips (float *rtime, float *ptime, long long *flpins, float *mflips) – copy current counts to array and reset counters PAPI_start_counters (int *events, int array_len) – add current counts to array and reset counters PAPI_read_counters (long long *values, int array_len) – gets instructions per cycle, real and processor time PAPI_accum_counters (long long *values, int array_len) – simplified call to get Mflops/s (floating point operation rate), real and processor time PAPI_ipc (float *rtime, float *ptime, long long *ins, float *ipc) – simplified call to get Mflips/s (floating point instruction rate), real and processor time PAPI_flops (float *rtime, float *ptime, long long *flpops, float *mflops) – get the number of hardware counters available on the system start counting hardware events PAPI_stop_counters (long long *values, int array_len) – Karl Fuerlinger stop counters and return current counts CS267 - Performance Analysis Tools | 28 PAPI Example Low Level API Usage #include "papi.h” #define NUM_EVENTS 2 int Events[NUM_EVENTS]={PAPI_FP_OPS,PAPI_TOT_CYC}, int EventSet; long long values[NUM_EVENTS]; /* Initialize the Library */ retval = PAPI_library_init (PAPI_VER_CURRENT); /* Allocate space for the new eventset and do setup */ retval = PAPI_create_eventset (&EventSet); /* Add Flops and total cycles to the eventset */ retval = PAPI_add_events (&EventSet,Events,NUM_EVENTS); /* Start the counters */ retval = PAPI_start (EventSet); do_work(); /* What we want to monitor*/ /*Stop counters and store results in values */ retval = PAPI_stop (EventSet,values); Karl Fuerlinger CS267 - Performance Analysis Tools | 29 Using PAPI through tools You can use PAPI directly in your application, but most people use it through tools Tool might have a predfined set of counters or lets you select counters through a configuration file/environment variable, etc. Tools using PAPI – – – – – – – Karl Fuerlinger TAU (UO) PerfSuite (NCSA) HPCToolkit (Rice) KOJAK, Scalasca (FZ Juelich, UTK) Open|Speedshop (SGI) ompP (UCB) IPM (LBNL) CS267 - Performance Analysis Tools | 30 Component PAPI Design – Re-Implementation of PAPI w/ support for multiple monitoring domains Low Level API Hi Level API PAPI Framework Layer PAPI Component Layer (network) Kernel Patch Operating System Perf Counter Hardware Karl Fuerlinger Devel API PAPI Component Layer (CPU) Kernel Patch Operating System Perf Counter Hardware PAPI Component Layer (thermal) Kernel Patch Operating System Perf Counter Hardware CS267 - Performance Analysis Tools | 31 Outline Motivation Concepts and definitions – Instrumentation, monitoring, analysis Some tools and their functionality – PAPI – access to hardware performance counters – ompP – profiling OpenMP code – IPM – monitoring message passing applications (Backup Slides) – Vampir – Kojak/Scalasca – TAU Karl Fuerlinger CS267 - Performance Analysis Tools | 32 OpenMP Performance Analysis with ompP ompP: Profiling tool for OpenMP – – – Based on source code instrumentation Independent of the compiler and runtime used Tested and supported: Linux, Solaris, AIX and Intel, Pathscale, PGI, IBM, gcc, SUN studio compilers – Supports HW counters through PAPI – Uses source code instrumenter Opari from the KOJAK/Scalasca toolset – Available for download (GPL): http://www.ompp-tool.com Automatic instrumentation of OpenMP constructs, manual region instrumentation Source Code ompP library Settings (env. Vars) HW Counters, output format,… Karl Fuerlinger Executable Execution on parallel machine Profiling Report CS267 - Performance Analysis Tools | 33 OpenMP OpenMP – Threads and fork/join based programming model – Worksharing constructs Master Thread Parallel Regions Characteristics – – – – – Karl Fuerlinger Directive based (compiler pragmas, comments) Incremental parallelization approach Well suited for loop-based parallel programming Less well suited for irregular parallelism (but tasking included in version 3.0 of the OpenMP specification). One of the contending programming paradigms for the “mutlicore era” CS267 - Performance Analysis Tools | 34 ompP’s Profiling Report Header – Date, time, duration of the run, number of threads, used hardware counters,… Region Overview – Number of OpenMP regions (constructs) and their source-code locations Flat Region Profile – Inclusive times, counts, hardware counter data Callgraph Callgraph Profiles – With Inclusive and exclusive times Overhead Analysis Report – Four overhead categories – Per-parallel region breakdown – Absolute times and percentages Karl Fuerlinger CS267 - Performance Analysis Tools | 35 Profiling Data Example profiling data Code: Profile: #pragma omp parallel { #pragma omp critical { sleep(1.0); } } R00002 main.c (34-37) (default) CRITICAL TID execT execC bodyT enterT 0 3.00 1 1.00 2.00 1 1.00 1 1.00 0.00 2 2.00 1 1.00 1.00 3 4.00 1 1.00 3.00 SUM 10.01 4 4.00 6.00 exitT 0.00 0.00 0.00 0.00 0.00 PAPI_TOT_INS 1595 6347 1595 1595 11132 Components: – – – – – Karl Fuerlinger Source code location and type of region Timing data and execution counts, depending on the particular construct One line per thread, last line sums over all threads Hardware counter data (if PAPI is available and HW counters are selected) Data is “exact” (measured, not based on sampling) CS267 - Performance Analysis Tools | 36 Flat Region Profile (2) Times and counts reported by ompP for various OpenMP constructs Ends with T: time Ends with C: count Main = enter + body + barr + exit Karl Fuerlinger CS267 - Performance Analysis Tools | 37 Callgraph Callgraph View – ‘Callgraph’ or ‘region stack’ of OpenMP constructs – Functions can be included by using Opari’s mechanism to instrument user defined regions: #pragma pomp inst begin(…), #pragma pomp inst end(…) Callgraph profile – Similar to flat profile, but with inclusive/exclusive times Example: main() { #pragma omp parallel { foo1(); foo2(); } } Karl Fuerlinger void foo1() { bar(); } void foo2() { bar(); } void bar() { #pragma omp critical { sleep(1.0); } } CS267 - Performance Analysis Tools | 38 Callgraph (2) Callgraph display Incl. CPU time 32.22 32.06 10.02 10.02 16.03 16.03 (100.0%) (99.50%) (31.10%) (31.10%) (49.74%) (49.74%) PARALLEL USERREG CRITICAL USERREG CRITICAL [APP 4 threads] +-R00004 main.c (42-46) |-R00001 main.c (19-21) ('foo1') | +-R00003 main.c (33-36) (unnamed) +-R00002 main.c (26-28) ('foo2') +-R00003 main.c (33-36) (unnamed) Callgraph profiles [*00] critical.ia64.ompp [+01] R00004 main.c (42-46) PARALLEL [+02] R00001 main.c (19-21) ('foo1') USER REGION TID execT/I execT/E execC 0 1.00 0.00 1 1 3.00 0.00 1 2 2.00 0.00 1 3 4.00 0.00 1 SUM 10.01 0.00 4 [*00] [+01] [+02] [=03] TID 0 1 2 3 SUM Karl Fuerlinger critical.ia64.ompp R00004 main.c (42-46) PARALLEL R00001 main.c (19-21) ('foo1') USER REGION R00003 main.c (33-36) (unnamed) CRITICAL execT execC bodyT/I bodyT/E 1.00 1 1.00 1.00 3.00 1 1.00 1.00 2.00 1 1.00 1.00 4.00 1 1.00 1.00 10.01 4 4.00 4.00 enterT 0.00 2.00 1.00 3.00 6.00 exitT 0.00 0.00 0.00 0.00 0.00 CS267 - Performance Analysis Tools | 39 Overhead Analysis (1) Certain timing categories reported by ompP can be classified as overheads: – Example: exitBarT: time wasted by threads idling at the exit barrier of work-sharing constructs. Reason is most likely an imbalanced amount of work Four overhead categories are defined in ompP: – Imbalance: waiting time incurred due to an imbalanced amount of work in a worksharing or parallel region – Synchronization: overhead that arises due to threads having to synchronize their activity, e.g. barrier call – Limited Parallelism: idle threads due not enough parallelism being exposed by the program – Thread management: overhead for the creation and destruction of threads, and for signaling critical sections, locks as available Karl Fuerlinger CS267 - Performance Analysis Tools | 40 Overhead Analysis (2) S: Synchronization overhead M: Thread management overhead Karl Fuerlinger I: Imbalance overhead L: Limited Parallelism overhead CS267 - Performance Analysis Tools | 41 ompP’s Overhead Analysis Report ------------------------------------------------------------------------ompP Overhead Analysis Report ------------------------------------------------------------------------------------------------Total runtime (wallclock) : 172.64 sec [32 threads] Number of threads, Number of parallel regions : 12 Parallel coverage : 134.83 sec (78.10%) parallel regions, parallel coverage Parallel regions sorted by wallclock time: Type Location R00011 PARALL mgrid.F (360-384) R00019 PARALL mgrid.F (403-427) R00009 PARALL mgrid.F (204-217) ... SUM Wallclock time x number of threads Wallclock (%) 55.75 (32.29) 23.02 (13.34) 11.94 ( 6.92) 134.83 (78.10) Overhead percentages wrt. this particular region(%) + Imbal (%) parallel + Limpar Overheads wrt. each individual parallel region: Total Ovhds (%) = Synch (%) R00011 1783.95 337.26 (18.91) 0.00 ( 0.00) R00019 736.80 129.95 (17.64) 0.00 ( 0.00) R00009 382.15 183.14 (47.92) 0.00 ( 0.00) R00015 276.11 68.85 (24.94) 0.00 ( 0.00) ... + 305.75 104.28 96.47 51.15 Overheads wrt. whole program: Total Ovhds (%) R00011 1783.95 337.26 ( 6.10) R00009 382.15 183.14 ( 3.32) R00005 264.16 164.90 ( 2.98) R00007 230.63 151.91 ( 2.75) ... SUM 4314.62 1277.89 (23.13) Synch (%) 0.00 ( 0.00) 0.00 ( 0.00) 0.00 ( 0.00) 0.00 ( 0.00) + Imbal 305.75 ( 96.47 ( 63.92 ( 68.58 ( 0.00 ( 0.00) 872.92 (15.80) = (17.14) (14.15) (25.24) (18.52) (%) 5.53) 1.75) 1.16) 1.24) 0.00 0.00 0.00 0.00 + ( ( ( ( 0.00) 0.00) 0.00) 0.00) 31.51 25.66 86.67 17.70 Mgmt (%) ( 1.77) ( 3.48) (22.68) ( 6.41) Limpar (%) 0.00 ( 0.00) 0.00 ( 0.00) 0.00 ( 0.00) 0.00 ( 0.00) + Mgmt (%) 31.51 ( 0.57) 86.67 ( 1.57) 100.98 ( 1.83) 83.33 ( 1.51) 0.00 ( 0.00) 404.97 ( 7.33) Overhead percentages wrt. whole program Karl Fuerlinger CS267 - Performance Analysis Tools | 42 OpenMP Scalability Analysis Methodology – – – – Classify execution time into “Work” and four overhead categories: “Thread Management”, “Limited Parallelism”, “Imbalance”, “Synchronization” Analyze how overheads behave for increasing thread counts Graphs show accumulated runtime over all threads for fixed workload (strong scaling) Horizontal line = perfect (linear) scalability Imperfect scaling Accumulated time Wallclock time Perfect (linear) scaling Super-linear scaling 1 Karl Fuerlinger 2 3 4 Thread Count 1 2 3 4 Thread Count CS267 - Performance Analysis Tools | 43 SPEC OpenMP Benchmarks (1) Application 314.mgrid_m – – – Karl Fuerlinger Scales relatively poorly, application has 12 parallel loops, all contribute with increasingly severe load imbalance Markedly smaller load imbalance for thread counts of 32 and 16. Only three loops show this behavior In all three cases, the iteration count is always a power of two (2 to 256), hence thread counts which are not a power of two exhibit more load imbalance CS267 - Performance Analysis Tools | 44 SPEC OpenMP Benchmarks (2) Application 316.applu – – – Super-linear speedup Only one parallel region (ssor.f 138-209) shows super-linear speedup, contributes 80% of accumulated total execution time Most likely reason for super-linear speedup: increased overall cache size L3_MISSES L3_MISSES 16000000000 14000000000 12000000000 10000000000 8000000000 6000000000 4000000000 2000000000 0 2 4 8 12 16 20 24 28 32 Number of Threads Karl Fuerlinger CS267 - Performance Analysis Tools | 45 SPEC OpenMP Benchmarks (3) Application 313.swim – – – – Karl Fuerlinger Dominating source of inefficiency is thread management overhead Main source: reduction of three scalar variables in a small parallel loop in swim.f 116-126. At 128 threads more than 6 percent of the total accumulated runtime is spent in the reduction operation Time for the reduction operation is larger than time spent in the body of the parallel region CS267 - Performance Analysis Tools | 46 SPEC OpenMP Benchmarks (4) Application 318.galgel – Scales very poorly, large fraction of overhead not accounted for by ompP (most likely memory access latency, cache conflicts, false sharing) – lapack.f90 5081-5092 contributes significantly to the bad scaling • accumulated CPU time increases from 107.9 (2 threads) to 1349.1 seconds (32 threads) • 32 thread version is only 22% faster than 2 thread version (wall-clock time) • 32 thread version parallel efficiency is only approx. 8% Whole application Karl Fuerlinger Region lapack.f90 5081-5092 CS267 - Performance Analysis Tools | 47 Incremental Profiling (1) Profiling vs. Tracing – Profiling: • • • – Tracing: • • • • low overhead small amounts of data easy to comprehend, even as simple ASCII text Large quantities of data hard to comprehend manually allows temporal phenomena to be explained causal relationship of events are preserved Idea: Combine advantages of profiling and tracing – Add a temporal dimension to profiling-type performance data – See what happens during the execution without capturing full traces – Manual interpretation becomes harder since a new dimension is added to the performance data Karl Fuerlinger CS267 - Performance Analysis Tools | 48 Incremental Profiling (2) Implementation: – Capture and dump profiling reports not only at the end of the execution but several times while the application executes – Analyze how profiling reports change over time – Capture points need not be regular “One-shot” Profiling time Incremental Profiling Karl Fuerlinger CS267 - Performance Analysis Tools | 49 Incremental Profiling (3) Possible triggers for capturing profiles: – Timer-based, fixed: capture profiles in regular, uniform intervals: predictable storage requirements (depends only on duration of program run, size of dataset). – Timer-based, adaptive: Adapt the capture rate to the behavior of the application: dump often if application behavior changes, decrease rate if application behavior stays the same – Counter overflow based: Dump a profile if a hardware counter overflows. Interesting for floating point intensive application – User-added: Expose API for dumping profiles to the user aligned to outer loop iterations or phase boundaries Karl Fuerlinger CS267 - Performance Analysis Tools | 50 Incremental Profiling Trigger currently implemented in ompP: – – – – – Capture profiles in regular intervals Timer signal is registered and delivered to profiler Profiling data up to capture point stored to memory buffer Dumped as individual profiling reports at the end of program execution Perl scripts to analyze reports and generate graphs Experiments – 1 second regular dump interval – SPEC OpenMP benchmark suite • Medium variant, 11 applications – 32 CPU SGI Altix machine • Itanium-2 processors with 1.6 GHz and 6 MB L3 cache • Used in batch mode Karl Fuerlinger CS267 - Performance Analysis Tools | 51 Incremental Profiling Profiling: Data Views (2) Overheads over time – – – – See how overheads change over the application run How is each Δt (1sec) spent for work or for one of the overhead classes: Either for whole program or for a specific parallel region Total incurred overhead=integral under this function Application: 328.fma3d_m Initialization in a critical section, effectively serializing the execution for approx. 15 seconds. Overhead=31/32=96% Karl Fuerlinger CS267 - Performance Analysis Tools | 52 Incremental Profiling Performance counter heatmaps – – – – Karl Fuerlinger x-axis: Time, y-axis: Thread-ID Color: number of hardware counter events observed during sampling period Application “applu”, medium-sized variant, counter: LOADS_RETIRED Visible phenomena: iterative behavior, thread grouping (pairs) CS267 - Performance Analysis Tools | 53 Outline Motivation Concepts and definitions – Instrumentation, monitoring, analysis Some tools and their functionality – PAPI – access to hardware performance counters – ompP – profiling OpenMP code – IPM – monitoring message passing applications (Backup Slides) – Vampir – Kojak/Scalasca – TAU Karl Fuerlinger CS267 - Performance Analysis Tools | 54 IPM – Integrated Performance Monitoring IPM provides a performance profile of a job – „Flip of a switch“ operation – http://ipm-hpc.sourceforge.net Input_123 Output_123 Karl Fuerlinger IPM Job_123 Profile_123 CS267 - Performance Analysis Tools | 55 IPM: Design Goals Provide high-level performance profile – Event inventory: which events happened and how much time did they take – How much time in communication operations, how much time in OpenMP parallel regions, how much time in file I/O – Less focus on drill-down into application than other tools Efficiency – Constant memory footprint (approx. 1-2 MB per MPI rank) – Monitorig data is kept in a hash-table, avoids dynamic memory allocation – Low CPU overhead: 1-2 % Ease of use – HTML, or ASCII-based based output format – Flip of a switch, no recompilation, no user instrumentation – Portability Karl Fuerlinger CS267 - Performance Analysis Tools | 56 IPM: Methodology MPI_Init() – Initialize monitoring environment, allocate memory For each MPI call – Compute hash key from • Type of call (send/recv/bcast/...) • Buffer size (in bytes) • Communication partner rank • Call-site, region or phase identifier, ... – Store / update value in hash table with timing data • Number of invocations • Minimum duration, maximum duration, summed time MPI_Finalize() – Aggregate, report to stdout, write XML log Karl Fuerlinger CS267 - Performance Analysis Tools | 57 Using IPM: Basics Do “module load ipm”, then run normally Upon completion you get ##IPMv0.85################################################################ # # command : ../exe/pmemd -O -c inpcrd -o res (completed) # host : s05405 mpi_tasks : 64 on 4 nodes # start : 02/22/05/10:03:55 wallclock : 24.278400 sec # stop : 02/22/05/10:04:17 %comm : 32.43 # gbytes : 2.57604e+00 total gflop/sec : 2.04615e+00 total # ########################################################################### Maybe that’s enough. If so you’re done. Have a nice day. Karl Fuerlinger CS267 - Performance Analysis Tools | 58 Want more detail? IPM_REPORT=full ##IPMv0.85##################################################################### # # command : ../exe/pmemd -O -c inpcrd -o res (completed) # host : s05405 mpi_tasks : 64 on 4 nodes # start : 02/22/05/10:03:55 wallclock : 24.278400 sec # stop : 02/22/05/10:04:17 %comm : 32.43 # gbytes : 2.57604e+00 total gflop/sec : 2.04615e+00 total # # [total] <avg> min max # wallclock 1373.67 21.4636 21.1087 24.2784 # user 936.95 14.6398 12.68 20.3 # system 227.7 3.55781 1.51 5 # mpi 503.853 7.8727 4.2293 9.13725 # %comm 32.4268 17.42 41.407 # gflop/sec 2.04614 0.0319709 0.02724 0.04041 # gbytes 2.57604 0.0402507 0.0399284 0.0408173 # gbytes_tx 0.665125 0.0103926 1.09673e-05 0.0368981 # gbyte_rx 0.659763 0.0103088 9.83477e-07 0.0417372 # Karl Fuerlinger CS267 - Performance Analysis Tools | 59 Want more detail? IPM_REPORT=full # PM_CYC 3.00519e+11 4.69561e+09 4.50223e+09 5.83342e+09 # PM_FPU0_CMPL 2.45263e+10 3.83223e+08 3.3396e+08 5.12702e+08 # PM_FPU1_CMPL 1.48426e+10 2.31916e+08 1.90704e+08 2.8053e+08 # PM_FPU_FMA 1.03083e+10 1.61067e+08 1.36815e+08 1.96841e+08 # PM_INST_CMPL 3.33597e+11 5.21245e+09 4.33725e+09 6.44214e+09 # PM_LD_CMPL 1.03239e+11 1.61311e+09 1.29033e+09 1.84128e+09 # PM_ST_CMPL 7.19365e+10 1.12401e+09 8.77684e+08 1.29017e+09 # PM_TLB_MISS 1.67892e+08 2.62332e+06 1.16104e+06 2.36664e+07 # # [time] [calls] <%mpi> <%wall> # MPI_Bcast 352.365 2816 69.93 22.68 # MPI_Waitany 81.0002 185729 16.08 5.21 # MPI_Allreduce 38.6718 5184 7.68 2.49 # MPI_Allgatherv 14.7468 448 2.93 0.95 # MPI_Isend 12.9071 185729 2.56 0.83 # MPI_Gatherv 2.06443 128 0.41 0.13 # MPI_Irecv 1.349 185729 0.27 0.09 # MPI_Waitall 0.606749 8064 0.12 0.04 # MPI_Gather 0.0942596 192 0.02 0.01 ############################################################################### Karl Fuerlinger CS267 - Performance Analysis Tools | 60 IPM: XML log files There’s a lot more information in the logfile than you get to stdout. A logfile is written that has the hash table, switch traffic, memory usage, executable information, ... Parallelism in writing of the log (when possible) The IPM logs are durable performance profiles serving – HPC center production needs: https://www.nersc.gov/nusers/status/llsum/ http://www.sdsc.edu/user_services/top/ipm/ – HPC research: ipm_parse renders txt and html http://www.nersc.gov/projects/ipm/ex3/ – your own XML consuming entity, feed, or process Karl Fuerlinger CS267 - Performance Analysis Tools | 61 Message Sizes : CAM 336 way per MPI call Karl Fuerlinger per MPI call & buffer size CS267 - Performance Analysis Tools | 62 Scalability: Required 32K tasks AMR code Karl Fuerlinger CS267 - Performance Analysis Tools | 63 More than a pretty picture Karl Fuerlinger CS267 - Performance Analysis Tools | 64 Application Assessment with IPM Provide high level performance numbers with small overhead What’s going on overall in my code? How is my load balance? – To get an initial read on application runtimes – For allocation/reporting – To check the performance weather on systems with high variability – How much comp, comm, I/O? – Where to start with optimization? – Domain decomposition vs. concurrency (M work on N tasks) Karl Fuerlinger CS267 - Performance Analysis Tools | 65 When to reach for another tool Full application tracing Looking for hotspots on the statement level in code Data structure level detail Automated performance feedback Karl Fuerlinger CS267 - Performance Analysis Tools | 66 What‘s wrong here? Karl Fuerlinger CS267 - Performance Analysis Tools | 67 Is MPI_Barrier time bad? Probably. Is it avoidable? The stray / unknown / debug barrier Barriers used for I/O ordering Karl Fuerlinger CS267 - Performance Analysis Tools | 68 Summary Performance montioring concepts – Instrument, measure, analyze – Profiling/tracing, sampling, direct measurment Tools – PAPI, ompP, and IPM as examples Lots of other tools – Vendor tools: Cray PAT, Sun Studio, Intel Thread Profiler, Vtune, PTU,… – Portable tools: TAU, Perfsuite, Paradyn, HPCToolkit, Kojak, Scalasca, Vampir, oprofile, gprof, … Karl Fuerlinger CS267 - Performance Analysis Tools | 69 Documentation, Manuals, User Guides PAPI – http://icl.cs.utk.edu/papi/ – ompP – http://www.ompp-tool.com IPM – http://ipm-hpc.sourceforge.net/ – TAU – http://www.cs.uoregon.edu/research/tau/ VAMPIR – http://www.vampir-ng.de/ – Scalasca – http://www.scalasca.org Karl Fuerlinger Thank you for your attention! CS267 - Performance Analysis Tools | 70 BACKUP Karl Fuerlinger SLIDES CS267 - Performance Analysis Tools | 71 Vampir Karl Fuerlinger – Trace Visualization CS267 - Performance Analysis Tools | 72 Vampir overview statistics Aggregated profiling information – Execution time – Number of calls This profiling information is computed from the trace – Change the selection in main timeline window Inclusive or exclusive of called routines Karl Fuerlinger CS267 - Performance Analysis Tools | 73 Timeline display To zoom, mark region with the mouse Karl Fuerlinger CS267 - Performance Analysis Tools | 74 Timeline display – message details Message information Click on message line Message send op Karl Fuerlinger Message receive op CS267 - Performance Analysis Tools | 77 Communication statistics Message statistics for each process/node pair: – Byte and message count – min/max/avg message length, bandwidth Karl Fuerlinger CS267 - Performance Analysis Tools | 78 Message histograms Message statistics by length, tag or communicator – Byte and message count – Min/max/avg bandwidth Karl Fuerlinger CS267 - Performance Analysis Tools | 79 Collective operations For each process: mark operation locally Stop of op Start of op Data being sent Connect start/stop points by lines Data being received Connection lines Karl Fuerlinger CS267 - Performance Analysis Tools | 80 Activity chart Profiling information for all processes Karl Fuerlinger CS267 - Performance Analysis Tools | 81 Process–local displays Timeline (showing calling levels) Activity chart Calling tree (showing number of calls) Karl Fuerlinger CS267 - Performance Analysis Tools | 82 Effects of zooming Updated message statistics Updated summary Select one iteration Karl Fuerlinger CS267 - Performance Analysis Tools | 83 KOJAK Karl Fuerlinger / Scalasca CS267 - Performance Analysis Tools | 84 Basic Idea “Traditional” Tool Automatic Tool Simple: 1 screen + 2 commands + 3 panes Relevant problems and data Huge amount of Measurement data For non-standard / tricky cases (10%) For expert users For standard cases (90% ?!) For “normal” users Starting point for experts More productivity for performance analysis process! Karl Fuerlinger CS267 - Performance Analysis Tools | 85 MPI-1 Pattern: Wait at Barrier Time spent in front of MPI synchronizing operation such as barriers Karl Fuerlinger CS267 - Performance Analysis Tools | 86 location MPI-1 Pattern: Late Sender / Receiver MPI_Send MPI_Recv MPI_Send MPI_Irecv MPI_Wait time location Late Sender: Time lost waiting caused by a blocking receive operation posted earlier than the corresponding send operation MPI_Send MPI_Recv MPI_Send MPI_Irecv MPI_Wait time Late Receiver: Time lost waiting in a blocking send operation until the corresponding receive operation is called Karl Fuerlinger CS267 - Performance Analysis Tools | 87 Performance Property What problem? Karl Fuerlinger Region Tree Where in source code? In what context? Color Coding Location How is the problem distributed across the machine? CS267 - Performance Analysis Tools | 88 How severe is the problem? KOJAK: sPPM run on (8x16x14) 1792 PEs Karl Fuerlinger Topology display Shows distribution of pattern over HW topology Easily scales to even larger systems CS267 - Performance Analysis Tools | 89 TAU Karl Fuerlinger CS267 - Performance Analysis Tools | 90 TAU Parallel Performance System http://www.cs.uoregon.edu/research/tau/ Multi-level performance instrumentation – Multi-language automatic source instrumentation Flexible and configurable performance measurement Widely-ported parallel performance profiling system – – Support for multiple parallel programming paradigms – Computer system architectures and operating systems Different programming languages and compilers Multi-threading, message passing, mixed-mode, hybrid Integration in complex software, systems, applications Karl Fuerlinger CS267 - Performance Analysis Tools | 91 ParaProf – 3D Scatterplot (Miranda) Each point is a “thread” of execution A total of four metrics shown in relation ParaVis 3D profile visualization library – JOGL 32k processors Karl Fuerlinger CS267 - Performance Analysis Tools | 92 ParaProf – 3D Scatterplot (SWEEP3D CUBE) Karl Fuerlinger CS267 - Performance Analysis Tools | 93 PerfExplorer - Cluster Analysis Four significant events automatically selected (from 16K processors) Clusters and correlations are visible Karl Fuerlinger CS267 - Performance Analysis Tools | 94 PerfExplorer - Correlation Analysis (Flash) Describes strength and direction of a linear relationship between two variables (events) in the data Karl Fuerlinger CS267 - Performance Analysis Tools | 95 PerfExplorer - Correlation Analysis (Flash) -0.995 indicates strong, negative relationship As CALC_CUT_ BLOCK_CONTRIBUTIONS() increases in execution time, MPI_Barrier() decreases Karl Fuerlinger CS267 - Performance Analysis Tools | 96