Computational Informatics for Brain Electromagnetic Research

Download Report

Transcript Computational Informatics for Brain Electromagnetic Research

Performance Technology for
Complex Parallel Systems
Sameer Shende, Allen D. Malony
University of Oregon
Overview

Introduction


Tuning and Analysis Utilities (TAU)







Instrumentation
Measurement
Analysis
Work in progress:


Definitions, general problem
Visualization: Vampir
Performance Monitoring and Steering
Performance Database Framework
Case Study: Uintah
Conclusions
General Problems
How do we create robust and ubiquitous
performance technology for the analysis and tuning
of parallel and distributed software and systems in
the presence of (evolving) complexity challenges?
How do we apply performance technology effectively
for the variety and diversity of performance
problems that arise in the context of complex
parallel and distributed computer systems.
Computation Model for Performance Technology

How to address dual performance technology goals?




Robust capabilities + widely available methodologies
Contend with problems of system diversity
Flexible tool composition/configuration/integration
Approaches

Restrict computation types / performance problems
 limited

performance technology coverage
Base technology on abstract computation model
 general
architecture and software execution features
 map features/methods to existing complex system types
 develop capabilities that can adapt and be optimized
General Complex System Computation Model

Node: physically distinct shared memory machine



Message passing node interconnection network
Context: distinct virtual memory space within node
Thread: execution threads (user/system) in context
Interconnection Network
physical
view
*
Node
Node
node memory
memory
VM
space
model
view
…
Node
SMP
…
Context
message
* Inter-node
communication
Threads
memory
Definitions – Profiling

Profiling

Recording of summary information during execution
 inclusive,

exclusive time, # calls, hardware statistics, …
Reflects performance behavior of program entities
 functions,
loops, basic blocks
 user-defined “semantic” entities



Very good for low-cost performance assessment
Helps to expose performance bottlenecks and hotspots
Implemented through
 sampling:
periodic OS interrupts or hardware counter traps
 instrumentation: direct insertion of measurement code
Definitions – Tracing

Tracing

Recording of information about significant points (events)
during program execution
 entering/exiting
code region (function, loop, block, …)
 thread/process interactions (e.g., send/receive message)

Save information in event record
 timestamp
 CPU
identifier, thread identifier
 Event type and event-specific information



Event trace is a time-sequenced stream of event records
Can be used to reconstruct dynamic program behavior
Typically requires code instrumentation
Event Tracing: Instrumentation, Monitor, Trace
Event definition
CPU A:
void master {
trace(ENTER, 1);
...
trace(SEND, B);
send(B, tag, buf);
...
trace(EXIT, 1);
}
CPU B:
void slave {
trace(ENTER, 2);
...
recv(A, tag, buf);
trace(RECV, A);
...
trace(EXIT, 2);
}
timestamp
MONITOR
1
master
2
slave
3
...
...
58 A
ENTER
1
60 B
ENTER
2
62 A
SEND
B
64 A
EXIT
1
68 B
RECV
A
69 B
EXIT
2
...
Event Tracing: “Timeline” Visualization
1
master
2
slave
3
...
main
master
slave
...
58 A
ENTER
1
60 B
ENTER
2
62 A
SEND
B
64 A
EXIT
1
68 B
RECV
A
69 B
EXIT
2
...
A
B
58 60 62 64 66 68 70
TAU Performance System Framework






Tuning and Analysis Utilities
Performance system framework for scalable parallel and distributed highperformance computing
Targets a general complex system computation model
 nodes / contexts / threads
 Multi-level: system / software / parallelism
 Measurement and analysis abstraction
Integrated toolkit for performance instrumentation, measurement, analysis, and
visualization
 Portable, configurable performance profiling/tracing facility
 Open software approach
University of Oregon, LANL, FZJ Germany
http://www.cs.uoregon.edu/research/paracomp/tau
Strategies for Empirical Performance Evaluation

Empirical performance evaluation as a series of
performance experiments


Experiment trials describing instrumentation and
measurement requirements
Where/When/How axes of empirical performance space
 where
are performance measurements made in program
 when is performance instrumentation done
 how are performance measurement/instrumentation chosen

Strategies for achieving flexibility and portability goals



Limited performance methods restrict evaluation scope
Non-portable methods force use of different techniques
Integration and combination of strategies
TAU Performance System Architecture
Paraver
EPILOG
TAU Instrumentation Options

Manual instrumentation


TAU Profiling API
Automatic instrumentation approaches




PDT – Source-to-source translation
MPI - Wrapper interposition library
Opari – OpenMP directive rewriting
Binary:
– Java virtual machine instrumentation
 DyninstAPI - Runtime code patching
 JVMPI
TAU Instrumentation


Targets common measurement interface (TAU API)
Object-based design and implementation



Macro-based, using constructor/destructor techniques
Program units: function, classes, templates, blocks
Uniquely identify functions and templates
 name
and type signature (name registration)
 static object creates performance entry
 dynamic object receives static object pointer
 runtime type identification for template instantiations


C and Fortran instrumentation variants
Instrumentation and measurement optimization
Multi-Level Instrumentation






Uses multiple instrumentation interfaces
Shares information: cooperation between interfaces
Taps information at multiple levels
Provides selective instrumentation at each level
Targets a common performance model
Presents a unified view of execution
Manual Instrumentation – Using TAU

Install TAU
% configure ; make clean install

Instrument application


Modify application makefile


TAU Profiling API
include TAU’s stub makefile, modify variables
Execute application
% mpirun –np <procs> a.out;

Analyze performance data

jracy, vampir, pprof, paraver …
TAU Manual Instrumentation API

Initialization and runtime configuration


Function and class methods


TAU_PROFILE(name, type, group);
Template


TAU_PROFILE_INIT(argc, argv);
TAU_PROFILE_SET_NODE(myNode);
TAU_PROFILE_SET_CONTEXT(myContext);
TAU_PROFILE_EXIT(message);
TAU_REGISTER_THREAD();
TAU_TYPE_STRING(variable, type);
TAU_PROFILE(name, type, group);
CT(variable);
User-defined timing

TAU_PROFILE_TIMER(timer, name, type, group);
TAU_PROFILE_START(timer);
TAU_PROFILE_STOP(timer); …
Manual Instrumentation – C++ Example
#include <TAU.h>
int main(int argc, char **argv)
{
TAU_PROFILE(“int main(int, char **)”, “ ”, TAU_DEFAULT);
TAU_PROFILE_INIT(argc, argv);
TAU_PROFILE_SET_NODE(0); /* for sequential programs */
foo();
return 0;
}
int foo(void)
{
TAU_PROFILE(“int foo(void)”, “ ”, TAU_DEFAULT); // measures entire foo()
TAU_PROFILE_TIMER(t, “foo(): for loop”, “[23:45 file.cpp]”, TAU_USER);
TAU_PROFILE_START(t);
for(int i = 0; i < N ; i++){
work(i);
}
TAU_PROFILE_STOP(t);
// other statements in foo …
}
Manual Instrumentation – C Example
#include <TAU.h>
int main(int argc, char **argv)
{
TAU_PROFILE_TIMER(tmain, “int main(int, char **)”, “ ”, TAU_DEFAULT);
TAU_PROFILE_INIT(argc, argv);
TAU_PROFILE_SET_NODE(0); /* for sequential programs */
TAU_PROFILE_START(tmain);
foo();
…
TAU_PROFILE_STOP(tmain);
return 0;
}
int foo(void)
{
TAU_PROFILE_TIMER(t, “foo()”, “ ”, TAU_USER);
TAU_PROFILE_START(t);
for(int i = 0; i < N ; i++){
work(i);
}
TAU_PROFILE_STOP(t);
}
Manual Instrumentation – F90 Example
cc34567 Cubes program – comment line
PROGRAM SUM_OF_CUBES
integer profiler(2)
save profiler
INTEGER :: H, T, U
call TAU_PROFILE_INIT()
call TAU_PROFILE_TIMER(profiler, 'PROGRAM SUM_OF_CUBES')
call TAU_PROFILE_START(profiler)
call TAU_PROFILE_SET_NODE(0)
! This program prints all 3-digit numbers that
! equal the sum of the cubes of their digits.
DO H = 1, 9
DO T = 0, 9
DO U = 0, 9
IF (100*H + 10*T + U == H**3 + T**3 + U**3) THEN
PRINT "(3I1)", H, T, U
ENDIF
END DO
END DO
END DO
call TAU_PROFILE_STOP(profiler)
END PROGRAM SUM_OF_CUBES
Instrumenting Multithreaded Applications
#include <TAU.h>
void * threaded_function(void *data)
{
TAU_REGISTER_THREAD(); // Before any other TAU calls
TAU_PROFILE(“void * threaded_function”, “ ”, TAU_DEFAULT);
work();
}
int main(int argc, char **argv)
{
TAU_PROFILE(“int main(int, char **)”, “ ”, TAU_DEFAULT);
TAU_PROFILE_INIT(argc, argv);
TAU_PROFILE_SET_NODE(0); /* for sequential programs */
pthread_attr_t attr;
pthread_t tid;
pthread_attr_init(&attr);
pthread_create(&tid, NULL, threaded_function, NULL);
return 0;
}
Compiling: TAU Makefiles


Include TAU Stub Makefile (<arch>/lib) in the user’s Makefile.
Variables:












TAU_CXX
Specify the C++ compiler used by TAU
TAU_CC, TAU_F90
Specify the C, F90 compilers
TAU_DEFS
Defines used by TAU. Add to CFLAGS
TAU_LDFLAGS
Linker options. Add to LDFLAGS
TAU_INCLUDE
Header files include path. Add to CFLAGS
TAU_LIBS
Statically linked TAU library. Add to LIBS
TAU_SHLIBS
Dynamically linked TAU library
TAU_MPI_LIBS
TAU’s MPI wrapper library for C/C++
TAU_MPI_FLIBS
TAU’s MPI wrapper library for F90
TAU_FORTRANLIBS Must be linked in with C++ linker for F90.
TAU_DISABLETAU’s dummy F90 stub library
Note: Not including TAU_DEFS in CFLAGS disables
instrumentation in C/C++ programs (TAU_DISABLE for f90).
Including TAU’s stub Makefile
include /usr/tau/sgi64/lib/Makefile.tau-pthread-kcc
CXX = $(TAU_CXX)
CC = $(TAU_CC)
CFLAGS = $(TAU_DEFS)
LIBS = $(TAU_LIBS)
OBJS = ...
TARGET= a.out
TARGET: $(OBJS)
$(CXX) $(LDFLAGS) $(OBJS) -o $@ $(LIBS)
.cpp.o:
$(CC) $(CFLAGS) -c $< -o $@
TAU Instrumentation Options

Manual instrumentation


TAU Profiling API
Automatic instrumentation approaches



PDT – Source-to-source translation
MPI - Wrapper interposition library
Opari – OpenMP directive rewriting
Program Database Toolkit (PDT)



Program code analysis framework for developing sourcebased tools
High-level interface to source code information
Integrated toolkit for source code parsing, database
creation, and database query





commercial grade front end parsers
portable IL analyzer, database format, and access API
open software approach for tool development
Target and integrate multiple source languages
Use in TAU to build automated performance
instrumentation tools
Program Database Toolkit
Application
/ Library
C / C++
parser
IL
C / C++
IL analyzer
Program
Database
Files
Fortran 77/90
parser
IL
Fortran 77/90
IL analyzer
DUCTAPE
PDBhtml
Program
documentation
SILOON
Application
component glue
CHASM
C++ / F90
interoperability
TAU_instr
Automatic source
instrumentation
PDT Components

Language front end




IL Analyzer



Edison Design Group (EDG): C, C++
Mutek Solutions Ltd.: F77, F90
creates an intermediate-language (IL) tree
processes the intermediate language (IL) tree
creates “program database” (PDB) formatted file
DUCTAPE (Bernd Mohr, ZAM, Germany)



C++ program Database Utilities and Conversion Tools
APplication Environment
processes and merges PDB files
C++ library to access the PDB for PDT applications
TAU Makefile for PDT – C++ Example
include /usr/tau/include/Makefile
CXX = $(TAU_CXX)
CC = $(TAU_CC)
PDTPARSE = $(PDTDIR)/$(CONFIG_ARCH)/bin/cxxparse
TAUINSTR = $(TAUROOT)/$(CONFIG_ARCH)/bin/tau_instrumentor
CFLAGS = $(TAU_DEFS)
LIBS = $(TAU_LIBS)
OBJS = ...
TARGET= a.out
TARGET: $(OBJS)
$(CXX) $(LDFLAGS) $(OBJS) -o $@ $(LIBS)
.cpp.o:
$(PDTPARSE) $<
$(TAUINSTR) $*.pdb $< -o $*.inst.cpp
$(CC) $(CFLAGS) -c $*.inst.cpp -o $@
Instrumentation Control

Selection of which performance events to observe



How is selection supported in instrumentation system?





Could depend on scope, type, level of interest
Could depend on instrumentation overhead
No choice
Include / exclude lists (TAU)
Environment variables
Static vs. dynamic
Problem: Controlling instrumentation of small routines


High relative measurement overhead
Significant intrusion and possible perturbation
Using PDT: tau_instrumentor
% tau_instrumentor
Usage : tau_instrumentor <pdbfile> <sourcefile> [-o <outputfile>] [-noinline]
[-g groupname] [-i headerfile] [-c|-c++|-fortran] [-f <instr_req_file> ]
For selective instrumentation, use –f option
% cat selective.dat
# Selective instrumentation: Specify an exclude/include list.
BEGIN_EXCLUDE_LIST
void quicksort(int *, int, int)
void sort_5elements(int *)
void interchange(int *, int *)
END_EXCLUDE_LIST
# If an include list is specified, the routines in the list will be the only
# routines that are instrumented.
# To specify an include list (a list of routines that will be instrumented)
# remove the leading # to uncomment the following lines
#BEGIN_INCLUDE_LIST
#int main(int, char **)
#int select_
#END_INCLUDE_LIST
Rule-Based Overhead Analysis (N. Trebon, UO)
Analyze the performance data to determine events with
high (relative) overhead performance measurements
 Create a select list for excluding those events
 Rule grammar (used in TAUreduce tool)

[GroupName:] Field Operator Number
 GroupName indicates rule applies to events in group
 Field is a event metric attribute (from profile statistics)
 numcalls,
numsubs, percent, usec, cumusec, count [PAPI],
totalcount, stdev, usecs/call, counts/call



Operator is one of >, <, or =
Number is any number
Compound rules possible using & between simple rules
Example Rules
#Exclude all events that are members of TAU_USER
#and use less than 1000 microseconds
TAU_USER:usec < 1000
 #Exclude all events that have less than 100
#microseconds and are called only once
usec < 1000 & numcalls = 1
 #Exclude all events that have less than 1000 usecs per
#call OR have a (total inclusive) percent less than 5
usecs/call < 1000
percent < 5
 Scientific notation can be used


usec>1000 & numcalls>400000 & usecs/call<30 & percent>25
TAU Instrumentation Options

Manual instrumentation


TAU Profiling API
Automatic instrumentation approaches



PDT – Source-to-source translation
MPI - Wrapper interposition library
Opari – OpenMP directive rewriting
TAU’s MPI Wrapper Interposition Library

Uses standard MPI Profiling Interface

Provides name shifted interface
 MPI_Send
= PMPI_Send
 Weak bindings

Interpose TAU’s MPI wrapper library between MPI and
TAU

-lmpi replaced by –lTauMpi –lpmpi –lmpi
MPI Library Instrumentation (MPI_Send)
int
MPI_Send(…) /* TAU redefines MPI_Send */
...
{
int returnVal, typesize;
TAU_PROFILE_TIMER(tautimer, "MPI_Send()", " ", TAU_MESSAGE);
TAU_PROFILE_START(tautimer);
if (dest != MPI_PROC_NULL) {
PMPI_Type_size(datatype, &typesize);
TAU_TRACE_SENDMSG(tag, dest, typesize*count);
}
/* Wrapper calls PMPI_Send */
returnVal = PMPI_Send(buf, count, datatype, dest, tag, comm);
TAU_PROFILE_STOP(tautimer);
return returnVal;
}
Including TAU’s stub Makefile
include /usr/tau/sgi64/lib/Makefile.tau-mpi
CXX = $(TAU_CXX)
CC = $(TAU_CC)
CFLAGS = $(TAU_DEFS)
LIBS = $(TAU_MPI_LIBS) $(TAU_LIBS)
LD_FLAGS = $(USER_OPT) $(TAU_LDFLAGS)
OBJS = ...
TARGET= a.out
TARGET: $(OBJS)
$(CXX) $(LDFLAGS) $(OBJS) -o $@ $(LIBS)
.cpp.o:
$(CC) $(CFLAGS) -c $< -o $@
TAU Instrumentation Options

Manual instrumentation


TAU Profiling API
Automatic instrumentation approaches



PDT – Source-to-source translation
MPI - Wrapper interposition library
Opari – OpenMP directive rewriting [FZJ, Germany]
Instrumentation of OpenMP Constructs



OpenMP Pragma And Region Instrumentor
Source-to-Source translator to insert POMP calls
around OpenMP constructs and API functions
Done: Supports






Fortran77 and Fortran90, OpenMP 2.0
C and C++, OpenMP 1.0
POMP Extensions
EPILOG and TAU POMP implementations
Preserves source code information (#line line
file)
Work in Progress:
Investigating standardization through OpenMP Forum
OpenMP API Instrumentation

Transform


omp_#_lock()
 pomp_#_lock()
omp_#_nest_lock() pomp_#_nest_lock()
[ # = init | destroy | set | unset | test ]

POMP version


Calls omp version internally
Can do extra stuff before and after call
Example: !$OMP PARALLEL DO Instrumentation
call pomp_parallel_fork(d)
!$OMP PARALLEL DO
other-clauses...
clauses...
call pomp_parallel_begin(d)
call pomp_do_enter(d)
!$OMP DO schedule-clauses, ordered-clauses,
lastprivate-clauses
do loop
!$OMP END DO NOWAIT
call pomp_barrier_enter(d)
!$OMP BARRIER
call pomp_barrier_exit(d)
call pomp_do_exit(d)
call pomp_parallel_end(d)
!$OMP END PARALLEL DO
call pomp_parallel_join(d)
Opari Instrumentation: Example

OpenMP directive instrumentation
pomp_for_enter(&omp_rd_2);
#line 252 "stommel.c"
#pragma omp for schedule(static) reduction(+: diff) private(j)
firstprivate (a1,a2,a3,a4,a5) nowait
for( i=i1;i<=i2;i++) {
for(j=j1;j<=j2;j++){
new_psi[i][j]=a1*psi[i+1][j] + a2*psi[i-1][j] + a3*psi[i][j+1]
+ a4*psi[i][j-1] - a5*the_for[i][j];
diff=diff+fabs(new_psi[i][j]-psi[i][j]);
}
}
pomp_barrier_enter(&omp_rd_2);
#pragma omp barrier
pomp_barrier_exit(&omp_rd_2);
pomp_for_exit(&omp_rd_2);
#line 261 "stommel.c"
OPARI: Basic Usage (f90)

Reset OPARI state information


Call OPARI for each input source file


rm -f opari.rc
opari file1.f90
...
opari fileN.f90
Generate OPARI runtime table, compile it with ANSI C

opari -table opari.tab.c
cc -c opari.tab.c

Compile modified files *.mod.f90 using OpenMP

Link the resulting object files, the OPARI runtime table
opari.tab.o and the TAU POMP RTL
OPARI: Makefile Template (C/C++)
OMPCC = ...
OMPCXX = ...
# insert C OpenMP compiler here
# insert C++ OpenMP compiler here
.c.o:
opari $<
$(OMPCC) $(CFLAGS) -c $*.mod.c
.cc.o:
opari $<
$(OMPCXX) $(CXXFLAGS) -c $*.mod.cc
opari.init:
rm -rf opari.rc
opari.tab.o:
opari -table opari.tab.c
$(CC) -c opari.tab.c
myprog: opari.init myfile*.o ... opari.tab.o
$(OMPCC) -o myprog myfile*.o opari.tab.o -lpomp
myfile1.o: myfile1.c myheader.h
myfile2.o: ...
OPARI: Makefile Template (Fortran)
OMPF77 = ...
OMPF90 = ...
# insert f77 OpenMP compiler here
# insert f90 OpenMP compiler here
.f.o:
opari $<
$(OMPF77) $(CFLAGS) -c $*.mod.F
.f90.o:
opari $<
$(OMPF90) $(CXXFLAGS) -c $*.mod.F90
opari.init:
rm -rf opari.rc
opari.tab.o:
opari -table opari.tab.c
$(CC) -c opari.tab.c
myprog: opari.init myfile*.o ... opari.tab.o
$(OMPF90) -o myprog myfile*.o opari.tab.o -lpomp
myfile1.o: myfile1.f90
myfile2.o: ...
TAU Measurement

Performance information



High-resolution timer library (real-time / virtual clocks)
General software counter library (user-defined events)
Hardware performance counters
 PAPI
(Performance API) (UTK, Ptools Consortium)
 consistent, portable API

Organization



Node, context, thread levels
Profile groups for collective events (runtime selective)
Performance data mapping between software levels
TAU Measurement (continued)

Parallel profiling






Tracing




Function-level, block-level, statement-level
Supports user-defined events
TAU parallel profile database
Callpath profiles
Hardware counts values
All profile-level events
Inter-process communication events
Timestamp synchronization
User-configurable measurement library (user controlled)
TAU Measurement System Configuration

configure [OPTIONS]
{-c++=<CC>, -cc=<cc>} Specify C++ and C compilers
 {-pthread, -sproc}
Use pthread or SGI sproc threads
 -openmp
Use OpenMP threads
 -opari=<dir>
Specify location of Opari OpenMP tool
 -papi=<dir>
Specify location of PAPI
 -pdt=<dir>
Specify location of PDT
 {-mpiinc=<d>, mpilib=<d>}Specify MPI library instrumentation
 -TRACE
Generate TAU event traces
 -PROFILE
Generate TAU profiles
 -PROFILECALLPATH
Generate Callpath profiles (1-level)
 -MULTIPLECOUNTERS
Use more than one hardware counter
 -CPUTIME
Use usertime+system time
 -PAPIWALLCLOCK
Use PAPI to access wallclock time
 -PAPIVIRTUAL
Use PAPI for virtual (user) time …

TAU Measurement Configuration – Examples

./configure -c++=xlC -cc=xlc –pdt=/usr/packages/pdtoolkit-2.1
-pthread



./configure -TRACE –PROFILE


Enable both TAU profiling and tracing
./configure -c++=CC -cc=cc –MULTIPLECOUNTERS
-papi=/usr/local/packages/papi –opari=/usr/local/opari-pomp-1.1
-mpiinc=/usr/packages/mpich/include
-mpilib=/usr/packages/mpich/lib –SGITIMERS -PAPIVIRTUAL


Use TAU with IBM’s xlC compiler, PDT and the pthread library
Enable TAU profiling (default)
Use OpenMP+MPI using SGI’s compiler suite, Opari and use PAPI
for accessing hardware performance counters & virtual time for
measurements
Typically configure multiple measurement libraries
Setup: Running Applications
%
%
%
%
setenv PROFILEDIR /home/data/experiments/profile/01
setenv TRACEDIR
/home/data/experiments/trace/01(optional)
set path=($path <taudir>/<arch>/bin)
setenv LD_LIBRARY_PATH $LD_LIBRARY_PATH\:<taudir>/<arch>/lib
For PAPI (1 counter):
% setenv PAPI_EVENT PAPI_FP_INS
For PAPI (multiplecounters):
% setenv COUNTER1 PAPI_FP_INS
(PAPI’s Floating point ins)
% setenv COUNTER2 PAPI_L1_DCM
(PAPI’s L1 Data cache misses)
% setenv COUNTER3 P_VIRTUAL_TIME
(PAPI’s virtual time)
% setenv COUNTER4 SGI_TIMERS
(Wallclock time)
% mpirun –np <n> <application>
% llsubmit job.sh
Performance Mapping
Associate performance with “significant” entities (events)
 Source code points are important


Functions, regions, control flow events, user events
Execution process and thread entities are important
 Some entities are more abstract, harder to measure
 Consider callgraph (callpath) profiling


Measure time (metric) along an edge (path) of callgraph
 Incident
edge gives parent / child view
 Edge sequence (path) gives parent / descendant view

Problem: Callpath profiling when callgraph is unknown


Determine callgraph dynamically at runtime
Map performance measurement to dynamic call path state
1-Level Callpath Implementation in TAU
TAU maintains a performance event (routine) callstack
 Profiled routine (child) looks in callstack for parent





Previous profiled performance event is the parent
A callpath profile structure created first time parent calls
TAU records parent in a callgraph map for child
String representing 1-level callpath used as its key
 “a(

)=>b( )” : name for time spent in “b” when called by “a”
Map returns pointer to callpath profile structure

1-level callpath is profiled using this profiling data
Build upon TAU’s performance mapping technology
 Measurement is independent of instrumentation
 Use –PROFILECALLPATH to configure TAU

TAU Analysis

Profile analysis

pprof
 parallel

profiler with text-based display
racy
 graphical

jracy
 Java

interface to pprof (Tcl/Tk)
implementation of Racy
Trace analysis and visualization




Trace merging and clock adjustment (if necessary)
Trace format conversion (ALOG, SDDF, Vampir)
Vampir (Pallas) trace visualization
Paraver (CEPBA) trace visualization
Pprof Command

pprof [-c|-b|-m|-t|-e|-i] [-r] [-s] [-n num] [-f file] [-l] [nodes]
 -c
Sort according to number of calls
 -b
Sort according to number of subroutines called
 -m
Sort according to msecs (exclusive time total)
 -t
Sort according to total msecs (inclusive time
total)
 -e
Sort according to exclusive time per call
 -i
Sort according to inclusive time per call
 -v
Sort according to standard deviation (exclusive
usec)
 -r
Reverse sorting order
 -s
Print only summary profile information
 -n num Print only first number of functions
 -f file
Specify full path and filename without node ids
 -l
List all functions and exit
TAU Parallel Performance Profiles
Terminology – Example


For routine “int main( )”:
Exclusive time


1 call
Subrs (no. of child
routines called)


f1(); /* takes 20 secs */
f2(); /* takes 50 secs */
f1(); /* takes 20 secs */
100 secs
Calls


100-20-50-20=10 secs
Inclusive time


int main( )
{ /* takes 100 secs */
3
Inclusive time/call

100secs
/* other work */
}
/*
Time can be replaced by
*/
counts
jracy (NAS Parallel Benchmark – LU)
Global profiles
n: node
c: context
t: thread
Individual profile
Routine
profile across
all nodes
jracy (Callpath Profiles) (R. A. Bell, UO)
Callpath
profile across
all nodes
Vampir Trace Visualization Tool





Visualization and
Analysis of MPI
Programs
Originally developed
by Forschungszentrum
Jülich
Current development
by Technical
University Dresden
Distributed by
PALLAS, Germany
http://www.pallas.de/pages/vampir.htm
Using TAU with Vampir

Configure TAU with -TRACE option
% configure –TRACE –SGITIMERS …

Execute application
% mpirun –np 4 a.out


This generates TAU traces and event descriptors
Merge all traces using tau_merge
% tau_merge *.trc app.trc

Convert traces to Vampir Trace format using tau_convert
% tau_convert –pv app.trc tau.edf app.pv
Note: Use –vampir instead of –pv for multi-threaded traces

Load generated trace file in Vampir
% vampir app.pv
Vampir: Main Window

Trace file loading can be




Provides main menu




Interrupted at any time
Resumed
Started at a specified time offset
Access to global and process local displays
Preferences
Help
Trace file can be re–written (re–grouped symbols)
Vampir: Timeline Diagram




Functions
organized into
groups
Coloring by
group
Message lines
can be colored
by tag or size
Information about states, messages, collective, and I/O
operations available by clicking on the representation
Vampir: Timeline Diagram (Message Info)

Source–code references are displayed if recorded in trace
Vampir: Execution Statistics Displays




Aggregated
profiling
information: execution time, # calls, inclusive/exclusive
Available for all/any group (activity)
Available for all routines (symbols)
Available for any trace part (select in timeline diagram)
Vampir: Communication Statistics Displays

Bytes sent/received for
collective operations

Message length statistics
Available for any trace part


Byte and message count,
min/max/avg message length
and min/max/avg bandwidth
for each process pair
Vampir: Other Features


Parallelism display

Powerful filtering and
trace comparison features
All diagrams highly
customizable (through
context menus)

Dynamic global call
graph tree
Vampir: Process Displays

Activity chart

Call tree


Timeline
For all selected processes in the global displays
Vampir (NAS Parallel Benchmark – LU)
Timeline display
Callgraph display
Parallelism display
Communications display
TAU Performance System Status

Computing platforms


Programming languages


MPI, PVM, Nexus, Tulip, ACLMPL, MPIJava
Thread libraries


C, C++, Fortran 77/90, HPF, Java
Communication libraries


IBM SP, SGI Origin, ASCI Red, Cray T3E, Compaq SC, HP, Sun,
Apple, Windows, IA-32, IA-64 (Linux), Hitachi, NEC
pthread, Java,Windows, SGI sproc, Tulip, SMARTS, OpenMP
Compilers

KAI (KCC, KAP/Pro), PGI, GNU, Fujitsu, HP, Sun, Microsoft, SGI,
Cray, IBM, HP, Compaq, Hitachi, NEC, Intel
PDT Status

Program Database Toolkit (Version 2.1, web download)






PDT-constructed tools



EDG C++ front end (Version 2.45.2)
Mutek Fortran 90 front end (Version 2.4.1)
C++ and Fortran 90 IL Analyzer
DUCTAPE library
Standard C++ system header files (KCC Version 4.0f)
TAU instrumentor (C/C++/F90)
Program analysis support for SILOON and CHASM
Platforms

SGI, IBM, Compaq, SUN, HP, Linux (IA32/IA64),
Apple, Windows, Cray T3E, Hitachi
Work in Progress

Visualization:



TAU will generate event-traces with PAPI performance
data. Vampir (v3.0) will support visualization of this data
Performance Monitoring and Steering
Performance Database Framework
Vampir v3.x: HPM Counter

Counter Timeline Display

Process Timeline Display
Performance Monitoring and Steering

Desirable to monitor performance during execution



Large-scale parallel applications complicate solutions




Long-running applications
Steering computations for improved performance
More parallel threads of execution producing data
Large amount of performance data (relative) to access
Analysis and visualization more difficult
Problem: Online performance data access and analysis



Incremental profile sampling (based on files)
Integration in computational steering system
Dynamic performance measurement and access
Online Performance Analysis (K. Li, UO)
SCIRun (Univ. of Utah)
Application
Performance
Steering
Performance
Visualizer
// performance
data streams
TAU
Performance
System
// performance
data output
file system
accumulated
samples
Performance
Data Integrator
• sample sequencing
• reader synchronization
Performance
Analyzer
Performance
Data Reader
2D Field Performance Visualization in SCIRun
SCIRun program
Uintah Computational Framework (UCF)
University
of Utah
 UCF analysis




Scheduling
MPI library
Components
500 processes
 Use for online
and offline
visualization
 Apply SCIRun
steering

Empirical-Based Performance Optimization
Process
Experiment
Schemas
Performance
Tuning
hypotheses
Performance
Diagnosis
properties
Experiment
Trials
Performance
Experimentation
characterization
Performance
Observation
observability
requirements
?
TAU Performance Database Framework
Raw performance data
Performance
data description
PerfDML
translators
Performance
analysis programs
Performance analysis
and query toolkit
ORDB
PostgreSQL



profile data only
XML representation
project / experiment / trial
...
PerfDB
PerfDBF Architecture (L. Li, R. Bell, UO)
App. profiled
With TAU
Standard TAU TAU to XML TAU XML
Converter
Output Data
Format
Database Loader
SQL
Database
Analysis
Tool
Scalability Analysis Process

Scalability study on LU

% suite.def # of procs -> 1, 2, 4, and 8
%
mpirun -np 1 lu.W1
 % mpirun -np 2 lu.W2
 % mpirun -np 4 lu.W4
 % mpirun -np 8 lu.W8



populateDatabase.sh
 run Java translator to translate profiles into XML
 run Java XML reader to write XML profiles to database
Read times for routines and program from experiments
Calculate scalability metrics
Contents of Performance Database
Scalability Analysis Results

Scalability of LU performance experiments
 Four
trial runs
Funname
….
applu
applu
applu
…
exact
exact
exact
| processors | meanspeedup
|2
|4
|8
| 2.0896117809566
| 4.812100975788783
| 8.168409581149514
|2
|4
|8
| 1.95853126762839071803
| 4.03622321124616535446
| 7.193812137750623668346
Current Status and Future

PerfDBF prototype




TAU profile to XML translator
XML to PerfDB populator
PostgresSQL database
Java-based PostgresSQL query module
 Use
as a layer to support performance analysis tools
 Make accessing the Performance Database quicker


Continue development
XML parallel profile representation

Basic specification
Overview

Introduction


Tuning and Analysis Utilities (TAU)







Instrumentation
Measurement
Analysis
Work in progress:


Definitions, general problem
Visualization: Vampir
Performance Monitoring and Steering
Performance Database Framework
Case Study: Uintah
Conclusions
Case Study: Utah ASCI/ASAP Level 1 Center

C-SAFE was established to build a problem-solving
environment (PSE) for the numerical simulation of
accidental fires and explosions




Fundamental chemistry and engineering physics models
Coupled with non-linear solvers, optimization,
computational steering, visualization, and experimental
data verification
Very large-scale simulations
Computer science problems:



Coupling of multiple simulation codes
Software engineering across diverse expert teams
Achieving high performance on large-scale systems
Example C-SAFE Simulation Problems
Heptane fire simulation
Material stress simulation
∑
Typical C-SAFE simulation with
a billion degrees of freedom and
non-linear time dynamics
Uintah High-Level Component View
Uintah Computational Framework

Execution model based on software (macro) dataflow


Exposes parallelism and hides data transport latency
Computations expressed a directed acyclic graphs of tasks
 consumes
input and produces output (input to future task)
 input/outputs specified for each patch in a structured grid

Abstraction of global single-assignment memory





DataWarehouse
Directory mapping names to values (array structured)
Write value once then communicate to awaiting tasks
Task graph gets mapped to processing resources
Communications schedule approximates global optimal
Uintah Task Graph (Material Point Method)


Diagram of named tasks
(ovals) and data (edges)
Imminent computation


Dataflow-constrained
MPM




Newtonian material point
motion time step
Solid: values defined at
material point (particle)
Dashed: values defined at
vertex (grid)
Prime (‘): values updated
during time step
Uintah PSE

UCF automatically sets up:






Domain decomposition
Inter-processor communication with aggregation/reduction
Parallel I/O
Checkpoint and restart
Performance measurement and analysis (stay tuned)
Software engineering





Coding standards
CVS (Commits: Y3 - 26.6 files/day, Y4 - 29.9 files/day)
Correctness regression testing with bugzilla bug tracking
Nightly build (parallel compiles)
170,000 lines of code (Fortran and C++ tasks supported)
Performance Technology Integration

Uintah present challenges to performance integration

Software diversity and structure
 UCF
middleware, simulation code modules
 component-based hierarchy

Portability objectives
 cross-language
and cross-platform
 multi-parallelism: thread, message passing, mixed




Scalability objectives
High-level programming and execution abstractions
Requires flexible and robust performance technology
Requires support for performance mapping
Task Execution in Uintah Parallel Scheduler

Profile methods
and functions in
scheduler and in
MPI library
Task execution time
dominates (what task?)
MPI communication
overheads (where?)

Need to map
performance data!
Task execution
time distribution
Semantics-Based Performance Mapping


Associate
performance
measurements
with high-level
semantic
abstractions
Need mapping
support in the
performance
measurement
system to assign
data correctly
Semantic Entities/Attributes/Associations (SEAA)

New dynamic mapping scheme




Entities defined at any level of abstraction
Attribute entity with semantic information
Entity-to-entity associations
Two association types (implemented in TAU API)


Embedded – extends data structure of associated object to
store performance measurement entity
External – creates an external look-up table using address
of object as the key to locate performance measurement
entity
Uintah Task Performance Mapping


Uintah partitions individual particles across processing
elements (processes or threads)
Simulation tasks in task graph work on particles

Tasks have domain-specific character in the computation
 “interpolate



particles to grid” in Material Point Method
Task instances generated for each partitioned particle set
Execution scheduled with respect to task dependencies
How to attributed execution time among different tasks

Assign semantic name (task type) to a task instance
 SerialMPM::interpolateParticleToGrid



Map TAU timer object to (abstract) task (semantic entity)
Look up timer object using task type (semantic attribute)
Further partition along different domain-specific axes
Using External Associations

Two level mappings:



Level 1: <task name, timer>
Level 2: <task name, patch, timer>
Embedded association
Data (object)
Performance Data
vs
External association
Hash Table
...
Task Performance Mapping Instrumentation
void MPIScheduler::execute(const ProcessorGroup * pc,
DataWarehouseP
& old_dw,
DataWarehouseP
& dw ) {
...
TAU_MAPPING_CREATE(
task->getName(), "[MPIScheduler::execute()]",
(TauGroup_t)(void*)task->getName(), task->getName(), 0);
...
TAU_MAPPING_OBJECT(tautimer)
TAU_MAPPING_LINK(tautimer,(TauGroup_t)(void*)task->getName());
// EXTERNAL ASSOCIATION
...
TAU_MAPPING_PROFILE_TIMER(doitprofiler, tautimer, 0)
TAU_MAPPING_PROFILE_START(doitprofiler,0);
task->doit(pc);
TAU_MAPPING_PROFILE_STOP(0);
...
}
Task Performance Mapping (Profile)
Mapped task
performance
across processes
Performance
mapping for
different tasks
Task Performance Mapping (Trace)
Work packet
computation
events colored
by task type
Distinct phases of
computation can be
identifed based on task
Task Performance Mapping (Trace - Zoom)
Startup
communication
imbalance
Task Performance Mapping (Trace - Parallelism)
Communication
/ load imbalance
Comparing Uintah Traces for Scalability Analysis
8 processes
32 processes
32 processes
8 processes
Scaling Performance Optimizations
Last year:
initial “correct”
scheduler
Reduce
communication
by 10 x
ASCI Nirvana
SGI Origin 2000
Los Alamos
National Laboratory
Reduce task
graph overhead
by 20 x
Scalability to 2000 Processors (Fall 2001)
ASCI Nirvana
SGI Origin 2000
Los Alamos
National Laboratory
Concluding Remarks
Complex software and parallel computing systems pose
challenging performance analysis problems that require
robust methodologies and tools
 To build more sophisticated performance tools, existing
proven performance technology must be utilized
 Performance tools must be integrated with software and
systems models and technology




Performance engineered software
Function consistently and coherently in software and
system environments
PAPI and TAU performance systems offer robust
performance technology that can be broadly integrated
Information




TAU (http://www.acl.lanl.gov/tau)
PDT (http://www.acl.lanl.gov/pdtoolkit)
PAPI (http://icl.cs.utk.edu/projects/papi/)
OPARI (http://www.fz-juelich.de/zam/kojak/)
Support Acknowledgement

TAU and PDT support:

Department of Energy (DOE)
 DOE
2000 ACTS contract
 DOE MICS contract
 DOE ASCI Level 3 (LANL, LLNL)
 U. of Utah DOE ASCI Level 1 subcontract


DARPA
NSF National Young Investigator (NYI) award