Computational Informatics for Brain Electromagnetic Research
Download
Report
Transcript Computational Informatics for Brain Electromagnetic Research
Performance Technology for
Complex Parallel Systems
Sameer Shende, Allen D. Malony
University of Oregon
Overview
Introduction
Tuning and Analysis Utilities (TAU)
Instrumentation
Measurement
Analysis
Work in progress:
Definitions, general problem
Visualization: Vampir
Performance Monitoring and Steering
Performance Database Framework
Case Study: Uintah
Conclusions
General Problems
How do we create robust and ubiquitous
performance technology for the analysis and tuning
of parallel and distributed software and systems in
the presence of (evolving) complexity challenges?
How do we apply performance technology effectively
for the variety and diversity of performance
problems that arise in the context of complex
parallel and distributed computer systems.
Computation Model for Performance Technology
How to address dual performance technology goals?
Robust capabilities + widely available methodologies
Contend with problems of system diversity
Flexible tool composition/configuration/integration
Approaches
Restrict computation types / performance problems
limited
performance technology coverage
Base technology on abstract computation model
general
architecture and software execution features
map features/methods to existing complex system types
develop capabilities that can adapt and be optimized
General Complex System Computation Model
Node: physically distinct shared memory machine
Message passing node interconnection network
Context: distinct virtual memory space within node
Thread: execution threads (user/system) in context
Interconnection Network
physical
view
*
Node
Node
node memory
memory
VM
space
model
view
…
Node
SMP
…
Context
message
* Inter-node
communication
Threads
memory
Definitions – Profiling
Profiling
Recording of summary information during execution
inclusive,
exclusive time, # calls, hardware statistics, …
Reflects performance behavior of program entities
functions,
loops, basic blocks
user-defined “semantic” entities
Very good for low-cost performance assessment
Helps to expose performance bottlenecks and hotspots
Implemented through
sampling:
periodic OS interrupts or hardware counter traps
instrumentation: direct insertion of measurement code
Definitions – Tracing
Tracing
Recording of information about significant points (events)
during program execution
entering/exiting
code region (function, loop, block, …)
thread/process interactions (e.g., send/receive message)
Save information in event record
timestamp
CPU
identifier, thread identifier
Event type and event-specific information
Event trace is a time-sequenced stream of event records
Can be used to reconstruct dynamic program behavior
Typically requires code instrumentation
Event Tracing: Instrumentation, Monitor, Trace
Event definition
CPU A:
void master {
trace(ENTER, 1);
...
trace(SEND, B);
send(B, tag, buf);
...
trace(EXIT, 1);
}
CPU B:
void slave {
trace(ENTER, 2);
...
recv(A, tag, buf);
trace(RECV, A);
...
trace(EXIT, 2);
}
timestamp
MONITOR
1
master
2
slave
3
...
...
58 A
ENTER
1
60 B
ENTER
2
62 A
SEND
B
64 A
EXIT
1
68 B
RECV
A
69 B
EXIT
2
...
Event Tracing: “Timeline” Visualization
1
master
2
slave
3
...
main
master
slave
...
58 A
ENTER
1
60 B
ENTER
2
62 A
SEND
B
64 A
EXIT
1
68 B
RECV
A
69 B
EXIT
2
...
A
B
58 60 62 64 66 68 70
TAU Performance System Framework
Tuning and Analysis Utilities
Performance system framework for scalable parallel and distributed highperformance computing
Targets a general complex system computation model
nodes / contexts / threads
Multi-level: system / software / parallelism
Measurement and analysis abstraction
Integrated toolkit for performance instrumentation, measurement, analysis, and
visualization
Portable, configurable performance profiling/tracing facility
Open software approach
University of Oregon, LANL, FZJ Germany
http://www.cs.uoregon.edu/research/paracomp/tau
Strategies for Empirical Performance Evaluation
Empirical performance evaluation as a series of
performance experiments
Experiment trials describing instrumentation and
measurement requirements
Where/When/How axes of empirical performance space
where
are performance measurements made in program
when is performance instrumentation done
how are performance measurement/instrumentation chosen
Strategies for achieving flexibility and portability goals
Limited performance methods restrict evaluation scope
Non-portable methods force use of different techniques
Integration and combination of strategies
TAU Performance System Architecture
Paraver
EPILOG
TAU Instrumentation Options
Manual instrumentation
TAU Profiling API
Automatic instrumentation approaches
PDT – Source-to-source translation
MPI - Wrapper interposition library
Opari – OpenMP directive rewriting
Binary:
– Java virtual machine instrumentation
DyninstAPI - Runtime code patching
JVMPI
TAU Instrumentation
Targets common measurement interface (TAU API)
Object-based design and implementation
Macro-based, using constructor/destructor techniques
Program units: function, classes, templates, blocks
Uniquely identify functions and templates
name
and type signature (name registration)
static object creates performance entry
dynamic object receives static object pointer
runtime type identification for template instantiations
C and Fortran instrumentation variants
Instrumentation and measurement optimization
Multi-Level Instrumentation
Uses multiple instrumentation interfaces
Shares information: cooperation between interfaces
Taps information at multiple levels
Provides selective instrumentation at each level
Targets a common performance model
Presents a unified view of execution
Manual Instrumentation – Using TAU
Install TAU
% configure ; make clean install
Instrument application
Modify application makefile
TAU Profiling API
include TAU’s stub makefile, modify variables
Execute application
% mpirun –np <procs> a.out;
Analyze performance data
jracy, vampir, pprof, paraver …
TAU Manual Instrumentation API
Initialization and runtime configuration
Function and class methods
TAU_PROFILE(name, type, group);
Template
TAU_PROFILE_INIT(argc, argv);
TAU_PROFILE_SET_NODE(myNode);
TAU_PROFILE_SET_CONTEXT(myContext);
TAU_PROFILE_EXIT(message);
TAU_REGISTER_THREAD();
TAU_TYPE_STRING(variable, type);
TAU_PROFILE(name, type, group);
CT(variable);
User-defined timing
TAU_PROFILE_TIMER(timer, name, type, group);
TAU_PROFILE_START(timer);
TAU_PROFILE_STOP(timer); …
Manual Instrumentation – C++ Example
#include <TAU.h>
int main(int argc, char **argv)
{
TAU_PROFILE(“int main(int, char **)”, “ ”, TAU_DEFAULT);
TAU_PROFILE_INIT(argc, argv);
TAU_PROFILE_SET_NODE(0); /* for sequential programs */
foo();
return 0;
}
int foo(void)
{
TAU_PROFILE(“int foo(void)”, “ ”, TAU_DEFAULT); // measures entire foo()
TAU_PROFILE_TIMER(t, “foo(): for loop”, “[23:45 file.cpp]”, TAU_USER);
TAU_PROFILE_START(t);
for(int i = 0; i < N ; i++){
work(i);
}
TAU_PROFILE_STOP(t);
// other statements in foo …
}
Manual Instrumentation – C Example
#include <TAU.h>
int main(int argc, char **argv)
{
TAU_PROFILE_TIMER(tmain, “int main(int, char **)”, “ ”, TAU_DEFAULT);
TAU_PROFILE_INIT(argc, argv);
TAU_PROFILE_SET_NODE(0); /* for sequential programs */
TAU_PROFILE_START(tmain);
foo();
…
TAU_PROFILE_STOP(tmain);
return 0;
}
int foo(void)
{
TAU_PROFILE_TIMER(t, “foo()”, “ ”, TAU_USER);
TAU_PROFILE_START(t);
for(int i = 0; i < N ; i++){
work(i);
}
TAU_PROFILE_STOP(t);
}
Manual Instrumentation – F90 Example
cc34567 Cubes program – comment line
PROGRAM SUM_OF_CUBES
integer profiler(2)
save profiler
INTEGER :: H, T, U
call TAU_PROFILE_INIT()
call TAU_PROFILE_TIMER(profiler, 'PROGRAM SUM_OF_CUBES')
call TAU_PROFILE_START(profiler)
call TAU_PROFILE_SET_NODE(0)
! This program prints all 3-digit numbers that
! equal the sum of the cubes of their digits.
DO H = 1, 9
DO T = 0, 9
DO U = 0, 9
IF (100*H + 10*T + U == H**3 + T**3 + U**3) THEN
PRINT "(3I1)", H, T, U
ENDIF
END DO
END DO
END DO
call TAU_PROFILE_STOP(profiler)
END PROGRAM SUM_OF_CUBES
Instrumenting Multithreaded Applications
#include <TAU.h>
void * threaded_function(void *data)
{
TAU_REGISTER_THREAD(); // Before any other TAU calls
TAU_PROFILE(“void * threaded_function”, “ ”, TAU_DEFAULT);
work();
}
int main(int argc, char **argv)
{
TAU_PROFILE(“int main(int, char **)”, “ ”, TAU_DEFAULT);
TAU_PROFILE_INIT(argc, argv);
TAU_PROFILE_SET_NODE(0); /* for sequential programs */
pthread_attr_t attr;
pthread_t tid;
pthread_attr_init(&attr);
pthread_create(&tid, NULL, threaded_function, NULL);
return 0;
}
Compiling: TAU Makefiles
Include TAU Stub Makefile (<arch>/lib) in the user’s Makefile.
Variables:
TAU_CXX
Specify the C++ compiler used by TAU
TAU_CC, TAU_F90
Specify the C, F90 compilers
TAU_DEFS
Defines used by TAU. Add to CFLAGS
TAU_LDFLAGS
Linker options. Add to LDFLAGS
TAU_INCLUDE
Header files include path. Add to CFLAGS
TAU_LIBS
Statically linked TAU library. Add to LIBS
TAU_SHLIBS
Dynamically linked TAU library
TAU_MPI_LIBS
TAU’s MPI wrapper library for C/C++
TAU_MPI_FLIBS
TAU’s MPI wrapper library for F90
TAU_FORTRANLIBS Must be linked in with C++ linker for F90.
TAU_DISABLETAU’s dummy F90 stub library
Note: Not including TAU_DEFS in CFLAGS disables
instrumentation in C/C++ programs (TAU_DISABLE for f90).
Including TAU’s stub Makefile
include /usr/tau/sgi64/lib/Makefile.tau-pthread-kcc
CXX = $(TAU_CXX)
CC = $(TAU_CC)
CFLAGS = $(TAU_DEFS)
LIBS = $(TAU_LIBS)
OBJS = ...
TARGET= a.out
TARGET: $(OBJS)
$(CXX) $(LDFLAGS) $(OBJS) -o $@ $(LIBS)
.cpp.o:
$(CC) $(CFLAGS) -c $< -o $@
TAU Instrumentation Options
Manual instrumentation
TAU Profiling API
Automatic instrumentation approaches
PDT – Source-to-source translation
MPI - Wrapper interposition library
Opari – OpenMP directive rewriting
Program Database Toolkit (PDT)
Program code analysis framework for developing sourcebased tools
High-level interface to source code information
Integrated toolkit for source code parsing, database
creation, and database query
commercial grade front end parsers
portable IL analyzer, database format, and access API
open software approach for tool development
Target and integrate multiple source languages
Use in TAU to build automated performance
instrumentation tools
Program Database Toolkit
Application
/ Library
C / C++
parser
IL
C / C++
IL analyzer
Program
Database
Files
Fortran 77/90
parser
IL
Fortran 77/90
IL analyzer
DUCTAPE
PDBhtml
Program
documentation
SILOON
Application
component glue
CHASM
C++ / F90
interoperability
TAU_instr
Automatic source
instrumentation
PDT Components
Language front end
IL Analyzer
Edison Design Group (EDG): C, C++
Mutek Solutions Ltd.: F77, F90
creates an intermediate-language (IL) tree
processes the intermediate language (IL) tree
creates “program database” (PDB) formatted file
DUCTAPE (Bernd Mohr, ZAM, Germany)
C++ program Database Utilities and Conversion Tools
APplication Environment
processes and merges PDB files
C++ library to access the PDB for PDT applications
TAU Makefile for PDT – C++ Example
include /usr/tau/include/Makefile
CXX = $(TAU_CXX)
CC = $(TAU_CC)
PDTPARSE = $(PDTDIR)/$(CONFIG_ARCH)/bin/cxxparse
TAUINSTR = $(TAUROOT)/$(CONFIG_ARCH)/bin/tau_instrumentor
CFLAGS = $(TAU_DEFS)
LIBS = $(TAU_LIBS)
OBJS = ...
TARGET= a.out
TARGET: $(OBJS)
$(CXX) $(LDFLAGS) $(OBJS) -o $@ $(LIBS)
.cpp.o:
$(PDTPARSE) $<
$(TAUINSTR) $*.pdb $< -o $*.inst.cpp
$(CC) $(CFLAGS) -c $*.inst.cpp -o $@
Instrumentation Control
Selection of which performance events to observe
How is selection supported in instrumentation system?
Could depend on scope, type, level of interest
Could depend on instrumentation overhead
No choice
Include / exclude lists (TAU)
Environment variables
Static vs. dynamic
Problem: Controlling instrumentation of small routines
High relative measurement overhead
Significant intrusion and possible perturbation
Using PDT: tau_instrumentor
% tau_instrumentor
Usage : tau_instrumentor <pdbfile> <sourcefile> [-o <outputfile>] [-noinline]
[-g groupname] [-i headerfile] [-c|-c++|-fortran] [-f <instr_req_file> ]
For selective instrumentation, use –f option
% cat selective.dat
# Selective instrumentation: Specify an exclude/include list.
BEGIN_EXCLUDE_LIST
void quicksort(int *, int, int)
void sort_5elements(int *)
void interchange(int *, int *)
END_EXCLUDE_LIST
# If an include list is specified, the routines in the list will be the only
# routines that are instrumented.
# To specify an include list (a list of routines that will be instrumented)
# remove the leading # to uncomment the following lines
#BEGIN_INCLUDE_LIST
#int main(int, char **)
#int select_
#END_INCLUDE_LIST
Rule-Based Overhead Analysis (N. Trebon, UO)
Analyze the performance data to determine events with
high (relative) overhead performance measurements
Create a select list for excluding those events
Rule grammar (used in TAUreduce tool)
[GroupName:] Field Operator Number
GroupName indicates rule applies to events in group
Field is a event metric attribute (from profile statistics)
numcalls,
numsubs, percent, usec, cumusec, count [PAPI],
totalcount, stdev, usecs/call, counts/call
Operator is one of >, <, or =
Number is any number
Compound rules possible using & between simple rules
Example Rules
#Exclude all events that are members of TAU_USER
#and use less than 1000 microseconds
TAU_USER:usec < 1000
#Exclude all events that have less than 100
#microseconds and are called only once
usec < 1000 & numcalls = 1
#Exclude all events that have less than 1000 usecs per
#call OR have a (total inclusive) percent less than 5
usecs/call < 1000
percent < 5
Scientific notation can be used
usec>1000 & numcalls>400000 & usecs/call<30 & percent>25
TAU Instrumentation Options
Manual instrumentation
TAU Profiling API
Automatic instrumentation approaches
PDT – Source-to-source translation
MPI - Wrapper interposition library
Opari – OpenMP directive rewriting
TAU’s MPI Wrapper Interposition Library
Uses standard MPI Profiling Interface
Provides name shifted interface
MPI_Send
= PMPI_Send
Weak bindings
Interpose TAU’s MPI wrapper library between MPI and
TAU
-lmpi replaced by –lTauMpi –lpmpi –lmpi
MPI Library Instrumentation (MPI_Send)
int
MPI_Send(…) /* TAU redefines MPI_Send */
...
{
int returnVal, typesize;
TAU_PROFILE_TIMER(tautimer, "MPI_Send()", " ", TAU_MESSAGE);
TAU_PROFILE_START(tautimer);
if (dest != MPI_PROC_NULL) {
PMPI_Type_size(datatype, &typesize);
TAU_TRACE_SENDMSG(tag, dest, typesize*count);
}
/* Wrapper calls PMPI_Send */
returnVal = PMPI_Send(buf, count, datatype, dest, tag, comm);
TAU_PROFILE_STOP(tautimer);
return returnVal;
}
Including TAU’s stub Makefile
include /usr/tau/sgi64/lib/Makefile.tau-mpi
CXX = $(TAU_CXX)
CC = $(TAU_CC)
CFLAGS = $(TAU_DEFS)
LIBS = $(TAU_MPI_LIBS) $(TAU_LIBS)
LD_FLAGS = $(USER_OPT) $(TAU_LDFLAGS)
OBJS = ...
TARGET= a.out
TARGET: $(OBJS)
$(CXX) $(LDFLAGS) $(OBJS) -o $@ $(LIBS)
.cpp.o:
$(CC) $(CFLAGS) -c $< -o $@
TAU Instrumentation Options
Manual instrumentation
TAU Profiling API
Automatic instrumentation approaches
PDT – Source-to-source translation
MPI - Wrapper interposition library
Opari – OpenMP directive rewriting [FZJ, Germany]
Instrumentation of OpenMP Constructs
OpenMP Pragma And Region Instrumentor
Source-to-Source translator to insert POMP calls
around OpenMP constructs and API functions
Done: Supports
Fortran77 and Fortran90, OpenMP 2.0
C and C++, OpenMP 1.0
POMP Extensions
EPILOG and TAU POMP implementations
Preserves source code information (#line line
file)
Work in Progress:
Investigating standardization through OpenMP Forum
OpenMP API Instrumentation
Transform
omp_#_lock()
pomp_#_lock()
omp_#_nest_lock() pomp_#_nest_lock()
[ # = init | destroy | set | unset | test ]
POMP version
Calls omp version internally
Can do extra stuff before and after call
Example: !$OMP PARALLEL DO Instrumentation
call pomp_parallel_fork(d)
!$OMP PARALLEL DO
other-clauses...
clauses...
call pomp_parallel_begin(d)
call pomp_do_enter(d)
!$OMP DO schedule-clauses, ordered-clauses,
lastprivate-clauses
do loop
!$OMP END DO NOWAIT
call pomp_barrier_enter(d)
!$OMP BARRIER
call pomp_barrier_exit(d)
call pomp_do_exit(d)
call pomp_parallel_end(d)
!$OMP END PARALLEL DO
call pomp_parallel_join(d)
Opari Instrumentation: Example
OpenMP directive instrumentation
pomp_for_enter(&omp_rd_2);
#line 252 "stommel.c"
#pragma omp for schedule(static) reduction(+: diff) private(j)
firstprivate (a1,a2,a3,a4,a5) nowait
for( i=i1;i<=i2;i++) {
for(j=j1;j<=j2;j++){
new_psi[i][j]=a1*psi[i+1][j] + a2*psi[i-1][j] + a3*psi[i][j+1]
+ a4*psi[i][j-1] - a5*the_for[i][j];
diff=diff+fabs(new_psi[i][j]-psi[i][j]);
}
}
pomp_barrier_enter(&omp_rd_2);
#pragma omp barrier
pomp_barrier_exit(&omp_rd_2);
pomp_for_exit(&omp_rd_2);
#line 261 "stommel.c"
OPARI: Basic Usage (f90)
Reset OPARI state information
Call OPARI for each input source file
rm -f opari.rc
opari file1.f90
...
opari fileN.f90
Generate OPARI runtime table, compile it with ANSI C
opari -table opari.tab.c
cc -c opari.tab.c
Compile modified files *.mod.f90 using OpenMP
Link the resulting object files, the OPARI runtime table
opari.tab.o and the TAU POMP RTL
OPARI: Makefile Template (C/C++)
OMPCC = ...
OMPCXX = ...
# insert C OpenMP compiler here
# insert C++ OpenMP compiler here
.c.o:
opari $<
$(OMPCC) $(CFLAGS) -c $*.mod.c
.cc.o:
opari $<
$(OMPCXX) $(CXXFLAGS) -c $*.mod.cc
opari.init:
rm -rf opari.rc
opari.tab.o:
opari -table opari.tab.c
$(CC) -c opari.tab.c
myprog: opari.init myfile*.o ... opari.tab.o
$(OMPCC) -o myprog myfile*.o opari.tab.o -lpomp
myfile1.o: myfile1.c myheader.h
myfile2.o: ...
OPARI: Makefile Template (Fortran)
OMPF77 = ...
OMPF90 = ...
# insert f77 OpenMP compiler here
# insert f90 OpenMP compiler here
.f.o:
opari $<
$(OMPF77) $(CFLAGS) -c $*.mod.F
.f90.o:
opari $<
$(OMPF90) $(CXXFLAGS) -c $*.mod.F90
opari.init:
rm -rf opari.rc
opari.tab.o:
opari -table opari.tab.c
$(CC) -c opari.tab.c
myprog: opari.init myfile*.o ... opari.tab.o
$(OMPF90) -o myprog myfile*.o opari.tab.o -lpomp
myfile1.o: myfile1.f90
myfile2.o: ...
TAU Measurement
Performance information
High-resolution timer library (real-time / virtual clocks)
General software counter library (user-defined events)
Hardware performance counters
PAPI
(Performance API) (UTK, Ptools Consortium)
consistent, portable API
Organization
Node, context, thread levels
Profile groups for collective events (runtime selective)
Performance data mapping between software levels
TAU Measurement (continued)
Parallel profiling
Tracing
Function-level, block-level, statement-level
Supports user-defined events
TAU parallel profile database
Callpath profiles
Hardware counts values
All profile-level events
Inter-process communication events
Timestamp synchronization
User-configurable measurement library (user controlled)
TAU Measurement System Configuration
configure [OPTIONS]
{-c++=<CC>, -cc=<cc>} Specify C++ and C compilers
{-pthread, -sproc}
Use pthread or SGI sproc threads
-openmp
Use OpenMP threads
-opari=<dir>
Specify location of Opari OpenMP tool
-papi=<dir>
Specify location of PAPI
-pdt=<dir>
Specify location of PDT
{-mpiinc=<d>, mpilib=<d>}Specify MPI library instrumentation
-TRACE
Generate TAU event traces
-PROFILE
Generate TAU profiles
-PROFILECALLPATH
Generate Callpath profiles (1-level)
-MULTIPLECOUNTERS
Use more than one hardware counter
-CPUTIME
Use usertime+system time
-PAPIWALLCLOCK
Use PAPI to access wallclock time
-PAPIVIRTUAL
Use PAPI for virtual (user) time …
TAU Measurement Configuration – Examples
./configure -c++=xlC -cc=xlc –pdt=/usr/packages/pdtoolkit-2.1
-pthread
./configure -TRACE –PROFILE
Enable both TAU profiling and tracing
./configure -c++=CC -cc=cc –MULTIPLECOUNTERS
-papi=/usr/local/packages/papi –opari=/usr/local/opari-pomp-1.1
-mpiinc=/usr/packages/mpich/include
-mpilib=/usr/packages/mpich/lib –SGITIMERS -PAPIVIRTUAL
Use TAU with IBM’s xlC compiler, PDT and the pthread library
Enable TAU profiling (default)
Use OpenMP+MPI using SGI’s compiler suite, Opari and use PAPI
for accessing hardware performance counters & virtual time for
measurements
Typically configure multiple measurement libraries
Setup: Running Applications
%
%
%
%
setenv PROFILEDIR /home/data/experiments/profile/01
setenv TRACEDIR
/home/data/experiments/trace/01(optional)
set path=($path <taudir>/<arch>/bin)
setenv LD_LIBRARY_PATH $LD_LIBRARY_PATH\:<taudir>/<arch>/lib
For PAPI (1 counter):
% setenv PAPI_EVENT PAPI_FP_INS
For PAPI (multiplecounters):
% setenv COUNTER1 PAPI_FP_INS
(PAPI’s Floating point ins)
% setenv COUNTER2 PAPI_L1_DCM
(PAPI’s L1 Data cache misses)
% setenv COUNTER3 P_VIRTUAL_TIME
(PAPI’s virtual time)
% setenv COUNTER4 SGI_TIMERS
(Wallclock time)
% mpirun –np <n> <application>
% llsubmit job.sh
Performance Mapping
Associate performance with “significant” entities (events)
Source code points are important
Functions, regions, control flow events, user events
Execution process and thread entities are important
Some entities are more abstract, harder to measure
Consider callgraph (callpath) profiling
Measure time (metric) along an edge (path) of callgraph
Incident
edge gives parent / child view
Edge sequence (path) gives parent / descendant view
Problem: Callpath profiling when callgraph is unknown
Determine callgraph dynamically at runtime
Map performance measurement to dynamic call path state
1-Level Callpath Implementation in TAU
TAU maintains a performance event (routine) callstack
Profiled routine (child) looks in callstack for parent
Previous profiled performance event is the parent
A callpath profile structure created first time parent calls
TAU records parent in a callgraph map for child
String representing 1-level callpath used as its key
“a(
)=>b( )” : name for time spent in “b” when called by “a”
Map returns pointer to callpath profile structure
1-level callpath is profiled using this profiling data
Build upon TAU’s performance mapping technology
Measurement is independent of instrumentation
Use –PROFILECALLPATH to configure TAU
TAU Analysis
Profile analysis
pprof
parallel
profiler with text-based display
racy
graphical
jracy
Java
interface to pprof (Tcl/Tk)
implementation of Racy
Trace analysis and visualization
Trace merging and clock adjustment (if necessary)
Trace format conversion (ALOG, SDDF, Vampir)
Vampir (Pallas) trace visualization
Paraver (CEPBA) trace visualization
Pprof Command
pprof [-c|-b|-m|-t|-e|-i] [-r] [-s] [-n num] [-f file] [-l] [nodes]
-c
Sort according to number of calls
-b
Sort according to number of subroutines called
-m
Sort according to msecs (exclusive time total)
-t
Sort according to total msecs (inclusive time
total)
-e
Sort according to exclusive time per call
-i
Sort according to inclusive time per call
-v
Sort according to standard deviation (exclusive
usec)
-r
Reverse sorting order
-s
Print only summary profile information
-n num Print only first number of functions
-f file
Specify full path and filename without node ids
-l
List all functions and exit
TAU Parallel Performance Profiles
Terminology – Example
For routine “int main( )”:
Exclusive time
1 call
Subrs (no. of child
routines called)
f1(); /* takes 20 secs */
f2(); /* takes 50 secs */
f1(); /* takes 20 secs */
100 secs
Calls
100-20-50-20=10 secs
Inclusive time
int main( )
{ /* takes 100 secs */
3
Inclusive time/call
100secs
/* other work */
}
/*
Time can be replaced by
*/
counts
jracy (NAS Parallel Benchmark – LU)
Global profiles
n: node
c: context
t: thread
Individual profile
Routine
profile across
all nodes
jracy (Callpath Profiles) (R. A. Bell, UO)
Callpath
profile across
all nodes
Vampir Trace Visualization Tool
Visualization and
Analysis of MPI
Programs
Originally developed
by Forschungszentrum
Jülich
Current development
by Technical
University Dresden
Distributed by
PALLAS, Germany
http://www.pallas.de/pages/vampir.htm
Using TAU with Vampir
Configure TAU with -TRACE option
% configure –TRACE –SGITIMERS …
Execute application
% mpirun –np 4 a.out
This generates TAU traces and event descriptors
Merge all traces using tau_merge
% tau_merge *.trc app.trc
Convert traces to Vampir Trace format using tau_convert
% tau_convert –pv app.trc tau.edf app.pv
Note: Use –vampir instead of –pv for multi-threaded traces
Load generated trace file in Vampir
% vampir app.pv
Vampir: Main Window
Trace file loading can be
Provides main menu
Interrupted at any time
Resumed
Started at a specified time offset
Access to global and process local displays
Preferences
Help
Trace file can be re–written (re–grouped symbols)
Vampir: Timeline Diagram
Functions
organized into
groups
Coloring by
group
Message lines
can be colored
by tag or size
Information about states, messages, collective, and I/O
operations available by clicking on the representation
Vampir: Timeline Diagram (Message Info)
Source–code references are displayed if recorded in trace
Vampir: Execution Statistics Displays
Aggregated
profiling
information: execution time, # calls, inclusive/exclusive
Available for all/any group (activity)
Available for all routines (symbols)
Available for any trace part (select in timeline diagram)
Vampir: Communication Statistics Displays
Bytes sent/received for
collective operations
Message length statistics
Available for any trace part
Byte and message count,
min/max/avg message length
and min/max/avg bandwidth
for each process pair
Vampir: Other Features
Parallelism display
Powerful filtering and
trace comparison features
All diagrams highly
customizable (through
context menus)
Dynamic global call
graph tree
Vampir: Process Displays
Activity chart
Call tree
Timeline
For all selected processes in the global displays
Vampir (NAS Parallel Benchmark – LU)
Timeline display
Callgraph display
Parallelism display
Communications display
TAU Performance System Status
Computing platforms
Programming languages
MPI, PVM, Nexus, Tulip, ACLMPL, MPIJava
Thread libraries
C, C++, Fortran 77/90, HPF, Java
Communication libraries
IBM SP, SGI Origin, ASCI Red, Cray T3E, Compaq SC, HP, Sun,
Apple, Windows, IA-32, IA-64 (Linux), Hitachi, NEC
pthread, Java,Windows, SGI sproc, Tulip, SMARTS, OpenMP
Compilers
KAI (KCC, KAP/Pro), PGI, GNU, Fujitsu, HP, Sun, Microsoft, SGI,
Cray, IBM, HP, Compaq, Hitachi, NEC, Intel
PDT Status
Program Database Toolkit (Version 2.1, web download)
PDT-constructed tools
EDG C++ front end (Version 2.45.2)
Mutek Fortran 90 front end (Version 2.4.1)
C++ and Fortran 90 IL Analyzer
DUCTAPE library
Standard C++ system header files (KCC Version 4.0f)
TAU instrumentor (C/C++/F90)
Program analysis support for SILOON and CHASM
Platforms
SGI, IBM, Compaq, SUN, HP, Linux (IA32/IA64),
Apple, Windows, Cray T3E, Hitachi
Work in Progress
Visualization:
TAU will generate event-traces with PAPI performance
data. Vampir (v3.0) will support visualization of this data
Performance Monitoring and Steering
Performance Database Framework
Vampir v3.x: HPM Counter
Counter Timeline Display
Process Timeline Display
Performance Monitoring and Steering
Desirable to monitor performance during execution
Large-scale parallel applications complicate solutions
Long-running applications
Steering computations for improved performance
More parallel threads of execution producing data
Large amount of performance data (relative) to access
Analysis and visualization more difficult
Problem: Online performance data access and analysis
Incremental profile sampling (based on files)
Integration in computational steering system
Dynamic performance measurement and access
Online Performance Analysis (K. Li, UO)
SCIRun (Univ. of Utah)
Application
Performance
Steering
Performance
Visualizer
// performance
data streams
TAU
Performance
System
// performance
data output
file system
accumulated
samples
Performance
Data Integrator
• sample sequencing
• reader synchronization
Performance
Analyzer
Performance
Data Reader
2D Field Performance Visualization in SCIRun
SCIRun program
Uintah Computational Framework (UCF)
University
of Utah
UCF analysis
Scheduling
MPI library
Components
500 processes
Use for online
and offline
visualization
Apply SCIRun
steering
Empirical-Based Performance Optimization
Process
Experiment
Schemas
Performance
Tuning
hypotheses
Performance
Diagnosis
properties
Experiment
Trials
Performance
Experimentation
characterization
Performance
Observation
observability
requirements
?
TAU Performance Database Framework
Raw performance data
Performance
data description
PerfDML
translators
Performance
analysis programs
Performance analysis
and query toolkit
ORDB
PostgreSQL
profile data only
XML representation
project / experiment / trial
...
PerfDB
PerfDBF Architecture (L. Li, R. Bell, UO)
App. profiled
With TAU
Standard TAU TAU to XML TAU XML
Converter
Output Data
Format
Database Loader
SQL
Database
Analysis
Tool
Scalability Analysis Process
Scalability study on LU
% suite.def # of procs -> 1, 2, 4, and 8
%
mpirun -np 1 lu.W1
% mpirun -np 2 lu.W2
% mpirun -np 4 lu.W4
% mpirun -np 8 lu.W8
populateDatabase.sh
run Java translator to translate profiles into XML
run Java XML reader to write XML profiles to database
Read times for routines and program from experiments
Calculate scalability metrics
Contents of Performance Database
Scalability Analysis Results
Scalability of LU performance experiments
Four
trial runs
Funname
….
applu
applu
applu
…
exact
exact
exact
| processors | meanspeedup
|2
|4
|8
| 2.0896117809566
| 4.812100975788783
| 8.168409581149514
|2
|4
|8
| 1.95853126762839071803
| 4.03622321124616535446
| 7.193812137750623668346
Current Status and Future
PerfDBF prototype
TAU profile to XML translator
XML to PerfDB populator
PostgresSQL database
Java-based PostgresSQL query module
Use
as a layer to support performance analysis tools
Make accessing the Performance Database quicker
Continue development
XML parallel profile representation
Basic specification
Overview
Introduction
Tuning and Analysis Utilities (TAU)
Instrumentation
Measurement
Analysis
Work in progress:
Definitions, general problem
Visualization: Vampir
Performance Monitoring and Steering
Performance Database Framework
Case Study: Uintah
Conclusions
Case Study: Utah ASCI/ASAP Level 1 Center
C-SAFE was established to build a problem-solving
environment (PSE) for the numerical simulation of
accidental fires and explosions
Fundamental chemistry and engineering physics models
Coupled with non-linear solvers, optimization,
computational steering, visualization, and experimental
data verification
Very large-scale simulations
Computer science problems:
Coupling of multiple simulation codes
Software engineering across diverse expert teams
Achieving high performance on large-scale systems
Example C-SAFE Simulation Problems
Heptane fire simulation
Material stress simulation
∑
Typical C-SAFE simulation with
a billion degrees of freedom and
non-linear time dynamics
Uintah High-Level Component View
Uintah Computational Framework
Execution model based on software (macro) dataflow
Exposes parallelism and hides data transport latency
Computations expressed a directed acyclic graphs of tasks
consumes
input and produces output (input to future task)
input/outputs specified for each patch in a structured grid
Abstraction of global single-assignment memory
DataWarehouse
Directory mapping names to values (array structured)
Write value once then communicate to awaiting tasks
Task graph gets mapped to processing resources
Communications schedule approximates global optimal
Uintah Task Graph (Material Point Method)
Diagram of named tasks
(ovals) and data (edges)
Imminent computation
Dataflow-constrained
MPM
Newtonian material point
motion time step
Solid: values defined at
material point (particle)
Dashed: values defined at
vertex (grid)
Prime (‘): values updated
during time step
Uintah PSE
UCF automatically sets up:
Domain decomposition
Inter-processor communication with aggregation/reduction
Parallel I/O
Checkpoint and restart
Performance measurement and analysis (stay tuned)
Software engineering
Coding standards
CVS (Commits: Y3 - 26.6 files/day, Y4 - 29.9 files/day)
Correctness regression testing with bugzilla bug tracking
Nightly build (parallel compiles)
170,000 lines of code (Fortran and C++ tasks supported)
Performance Technology Integration
Uintah present challenges to performance integration
Software diversity and structure
UCF
middleware, simulation code modules
component-based hierarchy
Portability objectives
cross-language
and cross-platform
multi-parallelism: thread, message passing, mixed
Scalability objectives
High-level programming and execution abstractions
Requires flexible and robust performance technology
Requires support for performance mapping
Task Execution in Uintah Parallel Scheduler
Profile methods
and functions in
scheduler and in
MPI library
Task execution time
dominates (what task?)
MPI communication
overheads (where?)
Need to map
performance data!
Task execution
time distribution
Semantics-Based Performance Mapping
Associate
performance
measurements
with high-level
semantic
abstractions
Need mapping
support in the
performance
measurement
system to assign
data correctly
Semantic Entities/Attributes/Associations (SEAA)
New dynamic mapping scheme
Entities defined at any level of abstraction
Attribute entity with semantic information
Entity-to-entity associations
Two association types (implemented in TAU API)
Embedded – extends data structure of associated object to
store performance measurement entity
External – creates an external look-up table using address
of object as the key to locate performance measurement
entity
Uintah Task Performance Mapping
Uintah partitions individual particles across processing
elements (processes or threads)
Simulation tasks in task graph work on particles
Tasks have domain-specific character in the computation
“interpolate
particles to grid” in Material Point Method
Task instances generated for each partitioned particle set
Execution scheduled with respect to task dependencies
How to attributed execution time among different tasks
Assign semantic name (task type) to a task instance
SerialMPM::interpolateParticleToGrid
Map TAU timer object to (abstract) task (semantic entity)
Look up timer object using task type (semantic attribute)
Further partition along different domain-specific axes
Using External Associations
Two level mappings:
Level 1: <task name, timer>
Level 2: <task name, patch, timer>
Embedded association
Data (object)
Performance Data
vs
External association
Hash Table
...
Task Performance Mapping Instrumentation
void MPIScheduler::execute(const ProcessorGroup * pc,
DataWarehouseP
& old_dw,
DataWarehouseP
& dw ) {
...
TAU_MAPPING_CREATE(
task->getName(), "[MPIScheduler::execute()]",
(TauGroup_t)(void*)task->getName(), task->getName(), 0);
...
TAU_MAPPING_OBJECT(tautimer)
TAU_MAPPING_LINK(tautimer,(TauGroup_t)(void*)task->getName());
// EXTERNAL ASSOCIATION
...
TAU_MAPPING_PROFILE_TIMER(doitprofiler, tautimer, 0)
TAU_MAPPING_PROFILE_START(doitprofiler,0);
task->doit(pc);
TAU_MAPPING_PROFILE_STOP(0);
...
}
Task Performance Mapping (Profile)
Mapped task
performance
across processes
Performance
mapping for
different tasks
Task Performance Mapping (Trace)
Work packet
computation
events colored
by task type
Distinct phases of
computation can be
identifed based on task
Task Performance Mapping (Trace - Zoom)
Startup
communication
imbalance
Task Performance Mapping (Trace - Parallelism)
Communication
/ load imbalance
Comparing Uintah Traces for Scalability Analysis
8 processes
32 processes
32 processes
8 processes
Scaling Performance Optimizations
Last year:
initial “correct”
scheduler
Reduce
communication
by 10 x
ASCI Nirvana
SGI Origin 2000
Los Alamos
National Laboratory
Reduce task
graph overhead
by 20 x
Scalability to 2000 Processors (Fall 2001)
ASCI Nirvana
SGI Origin 2000
Los Alamos
National Laboratory
Concluding Remarks
Complex software and parallel computing systems pose
challenging performance analysis problems that require
robust methodologies and tools
To build more sophisticated performance tools, existing
proven performance technology must be utilized
Performance tools must be integrated with software and
systems models and technology
Performance engineered software
Function consistently and coherently in software and
system environments
PAPI and TAU performance systems offer robust
performance technology that can be broadly integrated
Information
TAU (http://www.acl.lanl.gov/tau)
PDT (http://www.acl.lanl.gov/pdtoolkit)
PAPI (http://icl.cs.utk.edu/projects/papi/)
OPARI (http://www.fz-juelich.de/zam/kojak/)
Support Acknowledgement
TAU and PDT support:
Department of Energy (DOE)
DOE
2000 ACTS contract
DOE MICS contract
DOE ASCI Level 3 (LANL, LLNL)
U. of Utah DOE ASCI Level 1 subcontract
DARPA
NSF National Young Investigator (NYI) award