Debugging and Optimization Tools Richard Gerber NERSC User Services David Skinner NERSC Outreach, Software & Programming Group UCB CS267 February 15, 2011

Download Report

Transcript Debugging and Optimization Tools Richard Gerber NERSC User Services David Skinner NERSC Outreach, Software & Programming Group UCB CS267 February 15, 2011

Debugging and Optimization Tools

Richard Gerber

NERSC User Services

David Skinner

NERSC Outreach, Software & Programming Group UCB CS267 February 15, 2011

Outline

• • •

Introduction Debugging Performance / Optimization See slides and videos from NERSC Hopper Training http://newweb.nersc.gov/for-users/training-and tutorials/courses/CS267/ (newweb -> www sometime soon)

Introduction

• •

Scope of Today’s Talks

– Debugging and optimization tools (R. Gerber) – Some basic strategies for parallel performance (D. Skinner)

Take Aways

– Common problems to look out for – How tools work in general – A few specific tools you can try – Where to get more information 3

4

Debugging

Debugging

Typical problems

– “Serial” • • • Invalid memory references Array reference out of bounds Divide by zero • Uninitialized variables – Parallel • • Unmatched sends/receives Blocking receive before corresponding send • • Out of order collectives Race conditions 5

Tools

• • • •

printf, write

– Versatile, sometimes useful – Doesn’t scale well – Not interactive

Compiler / runtime

– Turn on bounds checking, exception handling – Check dereferencing of NULL pointers

Serial gdb

– GNU debugger, serial, command-line interface – See “man gdb”

Parallel GUI debuggers (X-Windows)

– DDT – Totalview 6

DDT video

http://vimeo.com/19978486

Or http://vimeo.com/user5729706

7

Compiler runtime bounds checking Out of bounds reference in source code for program “flip” ftn ftn ftn -c -g -Ktrap=fp –Mbounds flip.f90

-c -g -Ktrap=fp -Mbounds printit.f90

-o flip flip.o printit.o -g … allocate(put_seed(random_size)) … bad_index = random_size+1 put_seed(bad_index) = 67 % qsub –I –qdebug –lmppwidth=48 % cd $PBS_O_WORKDIR % % aprun –n 48 ./flip 0: Subscript out of range for array put_seed (flip.f90: 50) subscript=35, lower bound=1, upper bound=34, dimension=1 0: Subscript out of range for array put_seed (flip.f90: 50) subscript=35, lower bound=1, upper bound=34, dimension=1

8

9

Performance / Optimization

Performance Questions

• •

How can we tell if a program is performing well?

Or isn’t?

If performance is not “good,” how can we pinpoint why?

How can we identify the causes?

What can we do about it?

10

Performance Metrics

Primary metric: application time

– but gives little indication of efficiency •

Derived measures:

– rate (Ex.: messages per unit time, Flops per Second, clocks per instruction), cache utilization •

Indirect measures:

– speedup, parallel efficiency, scalability

11

Optimization Strategies

• •

Serial

– Leverage ILP on the processor – Feed the pipelines – Exploit data locality – Reuse data in cache

Parallel

– Minimizing latency effects – Maximizing work vs. communication 12

• • •

Identifying Targets for Optimization Sampling

– Regularly interrupt the program and record where it is – Build up a statistical profile

Tracing / Instrumenting

– Insert hooks into program to record and time events

Use Hardware Event Counters

– Special registers count events on processor – E.g. floating point instructions – Many possible events – Only a few (~4 counters) 13

Typical Process

(Sometimes) Modify your code with macros, API calls, timers

• •

Compile your code Transform your binary for profiling/tracing with a tool

• •

Run the transformed binary

– A data file is produced

Interpret the results with a tool 14

Performance Tools @ NERSC

• • •

Vendor Tools:

– CrayPat

Community Tools :

– TAU (U. Oregon via ACTS) – PAPI (Performance Application Programming Interface) – gprof

IPM: Integrated Performance Monitoring 15

Introduction to CrayPat

Suite of tools to provide a wide range of performance-related information

Can be used for both sampling and tracing user codes

– with or without hardware or network performance counters – Built on PAPI • •

Supports Fortran, C, C++, UPC, MPI, Coarray Fortran, OpenMP, Pthreads, SHMEM Man pages

– intro_craypat(1), intro_app2(1), intro_papi(1)

16

Using CrayPat 1.

2.

3.

4.

5.

– – – – – – – – – –

Access the tools

module load perftools

Build your application; keep .o files

make clean make

Instrument application

pat_build ... a.out

Result is a new file, a.out+pat

Run instrumented application to get top time consuming routines

aprun ... a.out+pat Result is a new file XXXXX.xf (or a directory containing .xf files)

Run pat_report on that new file; view results

pat_report XXXXX.xf

vi my_profile > my_profile Result is also a new file: XXXXX.ap2

17

Guidelines to Identify the Need for Optimization Derived metric

Computational intensity L1 cache hit ratio L1 cache utilization (misses) L1+L2 cache hit ratio L1+L2 cache utilization (misses) TLB utilization (FP Multiply / FP Ops) or (FP Add / FP Ops) Vectorization

Optimization needed when*

< 0.5 ops/ref < 90% < 1 avg hit < 92% < 1 avg hit < 0.9 avg use < 25% < 1.5 for dp; 3 for sp

PAT_RT_HWP C

0, 1 0, 1, 2 0, 1, 2 2 2 1 5 12 (13, 14) * Suggested by Cray

18

Can select new (additional) data file and do a screen dump

Apprentice Basic View

Worthless Useful Can select other views of the data

20

Can drag the “calipers” to focus the view on portions of the run

PAPI

PAPI (Performance API) provides a standard interface for use of the performance counters in major microprocessors

Predefined actual and derived counters supported on the system

– To see the list, run ‘papi_avail’ on compute node via aprun: module load perftools aprun –n 1 papi_avail •

AMD native events also provided; use ‘papi_native_avail’:

aprun –n 1 papi_native_avail

21

TAU

• • • • •

Tuning and Analysis Utilities Fortran, C, C++, Java performance tool Procedure

– Insert macros – Run the program – View results with pprof

More info that gprof

– E.g. per process, per thread info; supports pthreads

http://acts.nersc.gov/tau/index.html

22

IPM

• • • •

Integrated Performance Monitoring MPI profiling, hardware counter metrics, IO profiling (?) IPM requires no code modification & no instrumented binary

– Only a “module load ipm” before running your program on systems that support dynamic libraries – Else link with the IPM library

IPM uses hooks already in the MPI library to intercept your MPI calls and wrap them with timers and counters 23

IPM # host : s05601/006035314C00_AIX mpi_tasks : 32 on 2 nodes # start : 11/30/04/14:35:34 wallclock : 29.975184 sec # stop : 11/30/04/14:36:00 %comm # wallclock : 27.72

# gbytes : 6.65863e-01 total gflop/sec : 2.33478e+00 total # [total] min max 953.272 29.7897 29.6092 29.9752

# user 837.25 26.1641 25.71 26.92

# system 60.6 1.89375 1.52 2.59

# mpi 264.267 8.25834 7.73025 8.70985

# %comm # gflop/sec 2.33478 0.0729619 0.072204 0.0745817

# gbytes # MPI_Send 27.7234 25.8873 29.3705

0.665863 0.0208082 0.0195503 0.0237541

# PM_FPU0_CMPL 2.28827e+10 7.15084e+08 7.07373e+08 7.30171e+08 # PM_FPU1_CMPL 1.70657e+10 5.33304e+08 5.28487e+08 5.42882e+08 # PM_FPU_FMA 3.00371e+10 9.3866e+08 9.27762e+08 9.62547e+08 # PM_INST_CMPL 2.78819e+11 8.71309e+09 8.20981e+09 9.21761e+09 # PM_LD_CMPL 1.25478e+11 3.92118e+09 3.74541e+09 4.11658e+09 # PM_ST_CMPL 7.45961e+10 2.33113e+09 2.21164e+09 2.46327e+09 # PM_TLB_MISS 2.45894e+08 7.68418e+06 6.98733e+06 2.05724e+07 # PM_CYC 3.0575e+11 9.55467e+09 9.36585e+09 9.62227e+09 # [time] [calls] <%mpi> <%wall> 188.386 639616 71.29 19.76

# MPI_Wait 69.5032 639616 26.30 7.29

# MPI_Irecv # MPI_Barrier # MPI_Reduce # MPI_Comm_rank # MPI_Comm_size 6.34936 639616 2.40 0.67

0.0177442 32 0.01 0.00

0.00540609 32 0.00 0.00

0.00465156 32 0.00 0.00

0.000145341 32 0.00 0.00

24