The TAU Performance System
Download
Report
Transcript The TAU Performance System
The TAU Performance System
Allen D. Malony
Sameer S. Shende
Robert Bell
{malony,sameer,bell}@cs.uoregon.edu
Department of Computer and Information Science
Computational Science Institute
University of Oregon
Overview
Motivation and goals
TAU architecture and toolkit
Performance mapping
Application case studies
Instrumentation
Measurement
Analysis
…
TAU Integration
Work in progress
Conclusions
The TAU Performance System
2
SC2002 PERC Tutorial, Nov. 17, 2002
Motivation
Tools for performance problem solving
Empirical-based performance optimization process
Performance
Tuning
hypotheses
Performance
Diagnosis
Performance
Technology
properties
Performance
Experimentation
• Instrumentation
• Measurement
• Analysis
• Visualization
characterization
Performance
Observation
Versatile performance technology
Portable performance analysis methods
The TAU Performance System
3
SC2002 PERC Tutorial, Nov. 17, 2002
Problems
Diverse performance observability requirements
Multiple levels of software and hardware
Different types and detail of performance data
Alternative performance problem solving methods
Multiple targets of software and system application
Demands more robust performance technology
Broad scope of performance observation
Flexible and configurable mechanisms
Technology integration and extension
Cross-platform portability
Open, layered, and modular framework architecture
The TAU Performance System
4
SC2002 PERC Tutorial, Nov. 17, 2002
Complexity Challenges for Performance Tools
Computing system environment complexity
Observation integration and optimization
Access, accuracy, and granularity constraints
Diverse/specialized observation capabilities/technology
Restricted modes limit performance problem solving
Sophisticated software development environments
Programming paradigms and performance models
Performance data mapping to software abstractions
Uniformity of performance abstraction across platforms
Rich observation capabilities and flexible configuration
Common performance problem solving methods
The TAU Performance System
5
SC2002 PERC Tutorial, Nov. 17, 2002
General Problems (Performance Technology)
How do we create robust and ubiquitous
performance technology for the analysis and tuning
of parallel and distributed software and systems in
the presence of (evolving) complexity challenges?
How do we apply performance technology effectively
for the variety and diversity of performance
problems that arise in the context of complex
parallel and distributed computer systems?
The TAU Performance System
6
SC2002 PERC Tutorial, Nov. 17, 2002
Computation Model for Performance Technology
How to address dual performance technology goals?
Robust capabilities + widely available methods
Contend with problems of system diversity
Flexible tool composition/configuration/integration
Approaches
Restrict computation types / performance problems
machines,
languages, instrumentation technique, …
limited performance technology coverage and application
Base technology on abstract computation model
general
architecture and software execution features
map features/methods to existing complex system types
develop capabilities that can be adapted and optimized
The TAU Performance System
7
SC2002 PERC Tutorial, Nov. 17, 2002
General Complex System Computation Model
Node: physically distinct shared memory machine
Message passing node interconnection network
Context: distinct virtual memory space within node
Thread: execution threads (user/system) in context
Interconnection Network
physical
view
*
Node
Node
node memory
memory
VM
space
model
view
…
Node
SMP
memory
…
Context
The TAU Performance System
message
* Inter-node
communication
Threads
8
SC2002 PERC Tutorial, Nov. 17, 2002
TAU Performance System
Tuning and Analysis Utilities
Performance system framework for scalable parallel and
distributed high-performance computing
Targets a general complex system computation model
Integrated toolkit for performance instrumentation,
measurement, analysis, and visualization
nodes / contexts / threads
Multi-level: system / software / parallelism
Measurement and analysis abstraction
Portable performance profiling and tracing facility
Open software approach with technology integration
University of Oregon , Forschungszentrum Jülich, LANL
The TAU Performance System
9
SC2002 PERC Tutorial, Nov. 17, 2002
Definitions – Profiling
Profiling
Recording of summary information during execution
execution
time, # calls, hardware statistics, …
Reflects performance behavior of program entities
functions,
loops, basic blocks
user-defined “semantic” entities
Very good for low-cost performance assessment
Helps to expose performance bottlenecks and hotspots
Implemented through
sampling:
periodic OS interrupts or hardware counter traps
instrumentation: direct insertion of measurement code
The TAU Performance System
10
SC2002 PERC Tutorial, Nov. 17, 2002
Definitions – Tracing
Tracing
Recording of information about significant points (events)
during program execution
entering/exiting
code region (function, loop, block, …)
thread/process interactions (e.g., send/receive message)
Save information in event record
timestamp
CPU
identifier, thread identifier
Event type and event-specific information
Event trace is a time-sequenced stream of event records
Can be used to reconstruct dynamic program behavior
Typically requires code instrumentation
The TAU Performance System
11
SC2002 PERC Tutorial, Nov. 17, 2002
TAU Performance System Architecture
Paraver
The TAU Performance System
12
EPILOG
SC2002 PERC Tutorial, Nov. 17, 2002
TAU Performance Systems Goals
Multi-level performance instrumentation
Flexible and configurable performance measurement
Widely-ported parallel performance profiling system
Computer system architectures and operating systems
Different programming languages and compilers
Support for multiple parallel programming paradigms
Multi-language automatic source instrumentation
Multi-threading, message passing, mixed-mode, hybrid
Support for performance mapping
Support for object-oriented and generic programming
Integration in complex software systems and applications
The TAU Performance System
13
SC2002 PERC Tutorial, Nov. 17, 2002
How To Use TAU?
Instrumentation
Install, compile, and link with TAU measurement library
% configure; make clean install
Multiple configurations for different measurements options
Does not require change in instrumentation
Selective measurement control
Execute “experiments” produce performance data
Application code and libraries
Selective instrumentation
Performance data generated at end or during execution
Use analysis tools to look at performance results
The TAU Performance System
14
SC2002 PERC Tutorial, Nov. 17, 2002
TAU Instrumentation Approach
Support for standard program events
Support for user-defined events
Routines
Classes and templates
Statement-level blocks
Begin/End events (“user-defined timers”)
Atomic events
Selection of event statistics
Support definition of “semantic” entities for mapping
Support for event groups
Instrumentation optimization
The TAU Performance System
15
SC2002 PERC Tutorial, Nov. 17, 2002
TAU Instrumentation
Flexible instrumentation mechanisms at multiple levels
Source code
manual
automatic
C, C++, F77/90 (Program Database Toolkit (PDT))
OpenMP (directive rewriting (Opari))
Object code
pre-instrumented
libraries (e.g., MPI using PMPI)
statically-linked and dynamically-linked
fast breakpoints (compiler generated)
Executable code
dynamic
instrumentation (pre-execution) (DynInstAPI)
virtual machine instrumentation (e.g., Java using JVMPI)
The TAU Performance System
16
SC2002 PERC Tutorial, Nov. 17, 2002
Multi-Level Instrumentation
Targets common measurement interface
Multiple instrumentation interfaces
Utilizes instrumentation knowledge between levels
Selective instrumentation
Simultaneously active
Information sharing between interfaces
TAU API
Available at each level
Cross-level selection
Targets a common performance model
Presents a unified view of execution
Consistent performance events
The TAU Performance System
17
SC2002 PERC Tutorial, Nov. 17, 2002
Program Database Toolkit (PDT)
Program code analysis framework
High-level interface to source code information
Integrated toolkit for source code parsing, database
creation, and database query
develop source-based tools
Commercial grade front-end parsers
Portable IL analyzer, database format, and access API
Open software approach for tool development
Multiple source languages
Implement automatic performance instrumentation tools
tau_instrumentor
The TAU Performance System
18
SC2002 PERC Tutorial, Nov. 17, 2002
PDT Architecture and Tools
Application
/ Library
C / C++
parser
IL
C / C++
IL analyzer
Program
Database
Files
The TAU Performance System
Fortran 77/90
parser
IL
Fortran 77/90
IL analyzer
DUCTAPE
19
PDBhtml
Program
documentation
SILOON
Application
component glue
CHASM
C++ / F90
interoperability
TAU_instr
Automatic source
instrumentation
SC2002 PERC Tutorial, Nov. 17, 2002
PDT Components
Language front end
IL Analyzer
Edison Design Group (EDG): C, C++, Java
Mutek Solutions Ltd.: F77, F90
Processes intermediate language (IL) tree from front-end
Creates “program database” (PDB) formatted file
DUCTAPE (Bernd Mohr, FZJ/ZAM, Germany)
C++ program Database Utilities and Conversion Tools
APplication Environment
Processes and merges PDB files
C++ library to access the PDB for PDT applications
The TAU Performance System
20
SC2002 PERC Tutorial, Nov. 17, 2002
Instrumentation Control
Selection of which performance events to observe
How is selection supported in instrumentation system?
Could depend on scope, type, level of interest
Could depend on instrumentation overhead
No choice
Include / exclude lists (TAU)
Environment variables
Static vs. dynamic
Controlling the instrumentation of small routines
High relative measurement overhead
Significant intrusion and possible perturbation
The TAU Performance System
21
SC2002 PERC Tutorial, Nov. 17, 2002
Selective Instrumentation
% tau_instrumentor
Usage : tau_instrumentor <pdbfile> <sourcefile> [-o <outputfile>] [-noinline]
[-g groupname] [-i headerfile] [-c|-c++|-fortran] [-f <instr_req_file> ]
For selective instrumentation, use –f option
% cat selective.dat
# Selective instrumentation: Specify an exclude/include list.
BEGIN_EXCLUDE_LIST
void quicksort(int *, int, int)
void sort_5elements(int *)
void interchange(int *, int *)
END_EXCLUDE_LIST
# If an include list is specified, the routines in the list will be the only
# routines that are instrumented.
# To specify an include list (a list of routines that will be instrumented)
# remove the leading # to uncomment the following lines
#BEGIN_INCLUDE_LIST
#int main(int, char **)
#int select_
#END_INCLUDE_LIST
The TAU Performance System
22
SC2002 PERC Tutorial, Nov. 17, 2002
Overhead Analysis for Automatic Selection
Analyze the performance data to determine events with
high (relative) overhead performance measurements
Create a select list for excluding those events
Rule grammar (used in tau_reduce tool)
[GroupName:] Field Operator Number
GroupName indicates rule applies to events in group
Field is a event metric attribute (from profile statistics)
numcalls,
numsubs, percent, usec, cumusec, count,
totalcount, stdev, usecs/call, counts/call
Operator is one of >, <, or =
Number is any number
Compound rules possible using “&” between simple rules
The TAU Performance System
23
SC2002 PERC Tutorial, Nov. 17, 2002
Example Rules
#Exclude all events that are members of TAU_USER
#and use less than 1000 microseconds
TAU_USER:usec < 1000
#Exclude all events that have less than 100
#microseconds and are called only once
usec < 1000 & numcalls = 1
#Exclude all events that have less than 1000 usecs per
#call OR have a (total inclusive) percent less than 5
usecs/call < 1000
percent < 5
Scientific notation can be used
The TAU Performance System
24
SC2002 PERC Tutorial, Nov. 17, 2002
TAU Measurement
Performance information
Performance events
High-resolution timer library (real-time / virtual clocks)
General software counter library (user-defined events)
Hardware performance counters
PCL
(Performance Counter Library) (ZAM, Germany)
PAPI (Performance API) (UTK, Ptools Consortium)
consistent, portable API
Organization
Node, context, thread levels
Profile groups for collective events (runtime selective)
Performance data mapping between software levels
The TAU Performance System
25
SC2002 PERC Tutorial, Nov. 17, 2002
TAU Measurement Options
Parallel profiling
Function-level, block-level, statement-level
Supports user-defined events
TAU parallel profile data stored during execution
Hardware counts values
Support for multiple counters
Support for callpath profiling
Tracing
All profile-level events
Inter-process communication events
Timestamp synchronization
Trace merging and format conversion
The TAU Performance System
26
SC2002 PERC Tutorial, Nov. 17, 2002
TAU Measurement System Configuration
configure [OPTIONS]
{-c++=<CC>, -cc=<cc>} Specify C++ and C compilers
{-pthread, -sproc , -smarts} Use pthread, SGI sproc, smarts threads
-openmp
Use OpenMP threads
-opari=<dir>
Specify location of Opari OpenMP tool
{-papi ,-pcl=<dir>
Specify location of PAPI or PCL
-pdt=<dir>
Specify location of PDT
{-mpiinc=<d>, mpilib=<d>}Specify MPI library instrumentation
-TRACE
Generate TAU event traces
-PROFILE
Generate TAU profiles
-PROFILECALLPATH
Generate Callpath profiles (1-level)
-MULTIPLECOUNTERS
Use more than one hardware counter
-CPUTIME
Use usertime+system time
-PAPIWALLCLOCK
Use PAPI to access wallclock time
-PAPIVIRTUAL
Use PAPI for virtual (user) time …
The TAU Performance System
27
SC2002 PERC Tutorial, Nov. 17, 2002
TAU Measurement API
Initialization and runtime configuration
Function and class methods
TAU_PROFILE(name, type, group);
Template
TAU_PROFILE_INIT(argc, argv);
TAU_PROFILE_SET_NODE(myNode);
TAU_PROFILE_SET_CONTEXT(myContext);
TAU_PROFILE_EXIT(message);
TAU_REGISTIER_THREAD();
TAU_TYPE_STRING(variable, type);
TAU_PROFILE(name, type, group);
CT(variable);
User-defined timing
TAU_PROFILE_TIMER(timer, name, type, group);
TAU_PROFILE_START(timer);
TAU_PROFILE_STOP(timer);
The TAU Performance System
28
SC2002 PERC Tutorial, Nov. 17, 2002
TAU Measurement API (continued)
User-defined events
Mapping
TAU_REGISTER_EVENT(variable, event_name);
TAU_EVENT(variable, value);
TAU_PROFILE_STMT(statement);
TAU_MAPPING(statement, key);
TAU_MAPPING_OBJECT(funcIdVar);
TAU_MAPPING_LINK(funcIdVar, key);
TAU_MAPPING_PROFILE (funcIdVar);
TAU_MAPPING_PROFILE_TIMER(timer, funcIdVar);
TAU_MAPPING_PROFILE_START(timer);
TAU_MAPPING_PROFILE_STOP(timer);
Reporting
TAU_REPORT_STATISTICS();
TAU_REPORT_THREAD_STATISTICS();
The TAU Performance System
29
SC2002 PERC Tutorial, Nov. 17, 2002
Grouping Performance Data in TAU
Profile Groups
A group of related routines forms a profile group
Statically defined
TAU_DEFAULT,
TAU_IO, …
TAU_USER[1-5], TAU_MESSAGE,
Dynamically defined
group
name based on string, such as “adlib” or “particles”
runtime lookup in a map to get unique group identifier
uses tau_instrumentor to instrument
Ability to change group names at runtime
Group-based instrumentation and measurement control
The TAU Performance System
30
SC2002 PERC Tutorial, Nov. 17, 2002
TAU Group Instrumentation Control API
Enabling Profile Groups
Disabling Profile Groups
TAU_ENABLE_INSTRUMENTATION();
TAU_ENABLE_GROUP(TAU_GROUP);
TAU_ENABLE_GROUP_NAME(“group name”);
TAU_ENABLE_ALL_GROUPS();
TAU_DISABLE_INSTRUMENTATION();
TAU_DISABLE_GROUP(TAU_GROUP);
TAU_DISABLE_GROUP_NAME();
TAU_DISABLE_ALL_GROUPS();
Obtaining Profile Group Identifier
Runtime Switching of Profile Groups
The TAU Performance System
31
SC2002 PERC Tutorial, Nov. 17, 2002
TAU Pre-execution Control
Dynamic groups defined at file scope
Group names and group associations runtime modifiable
Controlling groups at pre-execution time
--profile <group1+group2+…+groupN> option
% tau_instrumentor app.pdb app.cpp
–o app.i.cpp –g “particles”
% mpirun –np 4 application
–profile particles+field+mesh+io
\
\
Examples:
POOMA (LANL) uses static groups
VTF (Caltech) uses dynamic group in Python-based
execution instrumentation control
The TAU Performance System
32
SC2002 PERC Tutorial, Nov. 17, 2002
Configuring TAU Measurement Library
Profiling with wallclock time (on a quad PIII Linux machine)
Tracing
% configure -mpiinc=/usr/local/packages/mpich/include
-mpilib=/usr/local/packages/mpich/lib -pdt=/usr/pkg/pdtoolkit/
-useropt=-O2 -LINUXTIMERS
% configure -mpiinc=/usr/local/packages/mpich/include
-mpilib=/usr/local/packages/mpich/lib -pdt=/usr/pkg/pdtoolkit
-useropt=-O2 -LINUXTIMERS
Profiling with PAPI
% configure -mpiinc=/usr/local/packages/mpich/include
-mpilib=/usr/local/packages/mpich/lib -pdt=/usr/pkg/pdtoolkit/
-useropt=-O2 -papi=/usr/local/packages/papi
% setenv PAPI_EVENT PAPI_FP_INS
% setenv PAPI_EVENT PAPI_L1_DCM
The TAU Performance System
33
SC2002 PERC Tutorial, Nov. 17, 2002
Compiling with TAU Makefiles
Include TAU Stub Makefile (<arch>/lib) in the user’s Makefile
Variables:
TAU_CXX
Specify the C++ compiler used by TAU
TAU_CC, TAU_F90
Specify the C, F90 compilers
TAU_DEFS
Defines used by TAU. Add to CFLAGS
TAU_LDFLAGS
Linker options. Add to LDFLAGS
TAU_INCLUDE
Header files include path. Add to CFLAGS
TAU_LIBS
Statically linked TAU library. Add to LIBS
TAU_SHLIBS
Dynamically linked TAU library
TAU_MPI_LIBS
TAU’s MPI wrapper library for C/C++
TAU_MPI_FLIBS
TAU’s MPI wrapper library for F90
TAU_FORTRANLIBS Must be linked in with C++ linker for F90.
TAU_DISABLETAU’s dummy F90 stub library
The TAU Performance System
34
SC2002 PERC Tutorial, Nov. 17, 2002
TAU Analysis
Parallel profile analysis
Pprof
parallel
profiler with text-based display
Racy
graphical
jRacy
Java
interface to pprof (Tcl/Tk)
implementation of Racy
Trace analysis and visualization
Trace merging and clock adjustment (if necessary)
Trace format conversion (ALOG, SDDF, VTF, Paraver)
Trace visualization using Vampir (Pallas)
The TAU Performance System
35
SC2002 PERC Tutorial, Nov. 17, 2002
Pprof Command
pprof [-c|-b|-m|-t|-e|-i] [-r] [-s] [-n num] [-f file] [-l] [nodes]
-c
Sort according to number of calls
-b
Sort according to number of subroutines called
-m
Sort according to msecs (exclusive time total)
-t
Sort according to total msecs (inclusive time total)
-e
Sort according to exclusive time per call
-i
Sort according to inclusive time per call
-v
Sort according to standard deviation (exclusive usec)
-r
Reverse sorting order
-s
Print only summary profile information
-n num Print only first number of functions
-f file
Specify full path and filename without node ids
-l nodes List all functions and exit (prints only info about all
contexts/threads of given node numbers)
The TAU Performance System
36
SC2002 PERC Tutorial, Nov. 17, 2002
Pprof Output (NAS Parallel Benchmark – LU)
Intel Quad
PIII Xeon
F90 +
MPICH
Profile
- Node
- Context
- Thread
Events
- code
- MPI
The TAU Performance System
37
SC2002 PERC Tutorial, Nov. 17, 2002
jRacy (NAS Parallel Benchmark – LU)
n: node
c: context
t: thread
Global profiles
Routine
profile across
all nodes
Event legend
Individual profile
The TAU Performance System
38
SC2002 PERC Tutorial, Nov. 17, 2002
Paraprof Profile Browser
The TAU Performance System
39
SC2002 PERC Tutorial, Nov. 17, 2002
Paraprof Profile Browser Main Window
The TAU Performance System
40
SC2002 PERC Tutorial, Nov. 17, 2002
Paraprof Profile Browser Node Window
The TAU Performance System
41
SC2002 PERC Tutorial, Nov. 17, 2002
Paraprof Profile Browser (Derived Metrics)
The TAU Performance System
42
SC2002 PERC Tutorial, Nov. 17, 2002
Paraprof Profile Browser Routine Window
The TAU Performance System
43
SC2002 PERC Tutorial, Nov. 17, 2002
TAU + PAPI (NAS Parallel Benchmark – LU )
Floating
point
operations
Re-link to
alternate
library
Can use
multiple
counter
support
The TAU Performance System
44
SC2002 PERC Tutorial, Nov. 17, 2002
TAU + Vampir (NAS Parallel Benchmark – LU)
Timeline display
Callgraph display
Parallelism display
Communications
display
The TAU Performance System
45
SC2002 PERC Tutorial, Nov. 17, 2002
tau_reduce Example
tau_reduce implements overhead reduction in TAU
Consider klargest example
Un-instrumented testcase: i = 2324, N = 1000000
Find kth largest element in a N elements
Compare two methods: quicksort, select_kth_largest
quicksort: (wall clock) = 0.188511 secs
select_kth_largest: (wall clock) = 0.149594 secs
Total: (PIII/1.2GHz time) = 0.340u 0.020s 0:00.37
Execute with all routines instrumented
Execute with rule-based selective instrumentation
usec>1000 & numcalls>400000 & usecs/call<30 & percent>25
The TAU Performance System
46
SC2002 PERC Tutorial, Nov. 17, 2002
Simple sorting example on one processor
Before selective instrumentation reduction
NODE 0;CONTEXT 0;THREAD 0:
--------------------------------------------------------------------------------------%Time
Exclusive
Inclusive
#Call
#Subrs Inclusive Name
msec
msec
usec/call
--------------------------------------------------------------------------------------100.0
13
4,982
1
4
4982030 int main
93.5
3,223
4,659 4.20241E+06 1.40268E+07
1 void quicksort
62.9
0.00481
3,134
5
5
626839 int kth_largest_qs
36.4
137
1,813
28
450057
64769 int select_kth_largest
33.6
150
1,675
449978
449978
4 void sort_5elements
28.8
1,435
1,435 1.02744E+07
0
0 void interchange
0.4
20
20
1
0
20668 void setup
0.0
0.0118
0.0118
49
0
0 int ceil
After selective instrumentation reduction
NODE 0;CONTEXT 0;THREAD 0:
--------------------------------------------------------------------------------------%Time
Exclusive
Inclusive
#Call
#Subrs Inclusive Name
msec
total msec
usec/call
--------------------------------------------------------------------------------------100.0
14
383
1
4
383333 int main
50.9
195
195
5
0
39017 int kth_largest_qs
40.0
153
153
28
79
5478 int select_kth_largest
5.4
20
20
1
0
20611 void setup
0.0
0.02
0.02
49
0
0 int ceil
The TAU Performance System
47
SC2002 PERC Tutorial, Nov. 17, 2002
TAU Performance System Status
Computing platforms
Programming languages
C, C++, Fortran 77, F90, HPF, Java, OpenMP, Python
Communication libraries
IBM SP / Power4, SGI Origin 2K/3K, ASCI Red, Cray
T3E / SV-1 (X-1 planned), HP (Compaq) SC (Tru64), HP
Superdome (HP-UX), Sun, Hitachi SR8000, NEX SX-5
(SX-6 underway), Linux clusters (IA-32/64, Alpha, PPC,
PA-RISC, Power), Apple (OS X), Windows
MPI, PVM, Nexus, shmem, Tulip, ACLMPL, MPIJava
Thread libraries
pthreads, SGI sproc, Java,Windows, OpenMP, SMARTS
The TAU Performance System
48
SC2002 PERC Tutorial, Nov. 17, 2002
TAU Performance System Status (continued)
Compilers
Application libraries (selected)
POOMA, MC++, Conejo, Uintah, VTF, UPS, GrACE
Performance projects using TAU
Blitz++, A++/P++, PETSc, SAMRAI, Overture, PAWS
Application frameworks (selected)
Intel KAI (KCC, KAP/Pro), PGI, GNU, Fujitsu, Sun,
Microsoft, SGI, Cray, IBM, Compaq, Hitachi, NEC, Intel
Aurora / SCALEA: ACPC, University of Vienna
TAU full distribution (Version 2.12, web download)
TAU performance system toolkit and user’s guide
Automatic software installation and examples
The TAU Performance System
49
SC2002 PERC Tutorial, Nov. 17, 2002
PDT Status
Program Database Toolkit (Version 2.2, web download)
PDT-constructed tools
EDG C++ front end (Version 2.45.2)
Mutek Fortran 90 front end (Version 2.4.1)
C++ and Fortran 90 IL Analyzer
DUCTAPE library
Standard C++ system header files (KCC Version 4.0f)
TAU instrumentor (C/C++/F90)
Program analysis support for SILOON and CHASM
Platforms
Same as for TAU with a few exceptions
The TAU Performance System
50
SC2002 PERC Tutorial, Nov. 17, 2002
Performance Mapping
High-level
semantic
abstractions
Associate
performance
measurements
Performance
mapping
performance
measurement
system support
to assign data
correctly
The TAU Performance System
51
SC2002 PERC Tutorial, Nov. 17, 2002
Semantic Entities/Attributes/Associations
New dynamic mapping scheme (SEAA)
Contrast with ParaMap (Miller and Irvin)
Entities defined at any level of abstraction
Attribute entity with semantic information
Entity-to-entity associations
Two association types (implemented in TAU API)
Embedded – extends associated
object to store performance
measurement entity
External – creates an external look-up
table using address of object as key to
locate performance measurement entity
The TAU Performance System
52
…
SC2002 PERC Tutorial, Nov. 17, 2002
Hypothetical Mapping Example
Particles distributed on surfaces of a cube
Particle* P[MAX]; /* Array of particles */
int GenerateParticles() {
/* distribute particles over all faces of the cube */
for (int face=0, last=0; face < 6; face++){
/* particles on this face */
int particles_on_this_face = num(face);
for (int i=last; i < particles_on_this_face; i++) {
/* particle properties are a function of face */
P[i] = ... f(face);
...
}
last+= particles_on_this_face;
}
}
The TAU Performance System
53
SC2002 PERC Tutorial, Nov. 17, 2002
Hypothetical Mapping Example (continued)
int ProcessParticle(Particle *p) {
/* perform some computation on p */
}
int main() {
GenerateParticles();
/* create a list of particles */
for (int i = 0; i < N; i++)
/* iterates over the list */
ProcessParticle(P[i]);
}
work
packets
engine
How much time is spent processing face i particles?
What is the distribution of performance among faces?
The TAU Performance System
54
SC2002 PERC Tutorial, Nov. 17, 2002
No Performance Mapping versus Mapping
Typical performance
tools report performance
with respect to routines
Does not provide support
for mapping
TAU (w/ mapping)
TAU (no mapping)
The TAU Performance System
Performance tools with
SEAA mapping can
observe performance with
respect to scientist’s
programming and
problem abstractions
55
SC2002 PERC Tutorial, Nov. 17, 2002
Performance Mapping in Callpath Profiling
Consider callgraph (callpath) profiling
Measure time (metric) along an edge (path) of callgraph
Incident
edge gives parent / child view
Edge sequence (path) gives parent / descendant view
Callpath profiling when callgraph is unknown
Must determine callgraph dynamically at runtime
Map performance measurement to dynamic call path state
Callpath levels
0-level: current callgraph node
1-level: immediate parent (descendant)
k-level: kth calling parent (call descendant)
The TAU Performance System
56
SC2002 PERC Tutorial, Nov. 17, 2002
1-Level Callpath Implementation in TAU
TAU maintains a performance event (routine) callstack
Profiled routine (child) looks in callstack for parent
Previous profiled performance event is the parent
A callpath profile structure created first time parent calls
TAU records parent in a callgraph map for child
String representing 1-level callpath used as its key
“a(
)=>b( )” : name for time spent in “b” when called by “a”
Map returns pointer to callpath profile structure
1-level callpath is profiled using this profiling data
Build upon TAU’s performance mapping technology
Measurement is independent of instrumentation
Use –PROFILECALLPATH to configure TAU
The TAU Performance System
57
SC2002 PERC Tutorial, Nov. 17, 2002
Callpath Profiling Example (NAS LU v2.3)
% configure -PROFILECALLPATH -SGITIMERS -arch=sgi64
-mpiinc=/usr/include -mpilib=/usr/lib64 -useropt=-O2
The TAU Performance System
58
SC2002 PERC Tutorial, Nov. 17, 2002
Callpath Parallel Profile Display
0-level and 1-level callpath grouping
0-Level Callpath
The TAU Performance System
1-Level Callpath
59
SC2002 PERC Tutorial, Nov. 17, 2002
Strategies for Empirical Performance Evaluation
Empirical performance evaluation as a series of
performance experiments
Experiment trials describing instrumentation and
measurement requirements
Where/When/How axes of empirical performance space
where
are performance measurements made in program
when is performance instrumentation done
how are performance measurement/instrumentation chosen
Strategies for achieving flexibility and portability goals
Limited performance methods restrict evaluation scope
Non-portable methods force use of different techniques
Integration and combination of strategies
The TAU Performance System
60
SC2002 PERC Tutorial, Nov. 17, 2002
Case Study: SIMPLE Performance Analysis
SIMPLE hydrodynamics benchmark
C code with MPI message communication
Multiple instrumentation methods
source-to-source
translation (PDT)
MPI wrapper library level instrumentation (PMPI)
pre-execution binary instrumentation (DyninstAPI)
Alternative measurement strategies
statistical
profiles of software actions
statistical profiles of hardware actions (PCL, PAPI)
program event tracing
choice of time source
gettimeofday, high-res physical, CPU, process virtual
The TAU Performance System
61
SC2002 PERC Tutorial, Nov. 17, 2002
SIMPLE Source Instrumentation (Preprocessed)
PDT automatically generates instrumentation code
names events with full function signatures
int compute_heat_conduction(
double theta_hat[X][Y], double deltat, double new_r[X][Y],
double new_z[X][Y], double new_alpha[X][Y],
double new_rho[X][Y], double theta_l[X][Y],
double Gamma_k[X][Y], double Gamma_l[X][Y])
{
TAU_PROFILE("int compute_heat_conduction(
double (*)[259], double, double (*)[259],
double (*)[259], double (*)[259], double (*)[259],
double (*)[259], double (*)[259], double (*)[259])",
" ", TAU_USER);
...
}
Similarly for all other routines in SIMPLE program
The TAU Performance System
62
SC2002 PERC Tutorial, Nov. 17, 2002
MPI Library Instrumentation (MPI_Send)
int
Uses MPI profiling interposition library (PMPI)
MPI_Send(…)
...
{
int returnVal, typesize;
TAU_PROFILE_TIMER(tautimer, "MPI_Send()", " ", TAU_MESSAGE);
TAU_PROFILE_START(tautimer);
if (dest != MPI_PROC_NULL) {
PMPI_Type_size(datatype, &typesize);
TAU_TRACE_SENDMSG(tag, dest, typesize*count);
}
returnVal = PMPI_Send(buf, count, datatype, dest, tag, comm);
TAU_PROFILE_STOP(tautimer);
return returnVal;
}
The TAU Performance System
63
SC2002 PERC Tutorial, Nov. 17, 2002
MPI Library Instrumentation (MPI_Recv)
int MPI_Recv(…)
...
{
int returnVal, size;
TAU_PROFILE_TIMER(tautimer, "MPI_Recv()", " ", TAU_MESSAGE);
TAU_PROFILE_START(tautimer);
returnVal = PMPI_Recv(buf, count, datatype, src, tag, comm,
status);
if (src != MPI_PROC_NULL && returnVal == MPI_SUCCESS) {
PMPI_Get_count( status, MPI_BYTE, &size );
TAU_TRACE_RECVMSG(status->MPI_TAG, status->MPI_SOURCE,
size);
}
TAU_PROFILE_STOP(tautimer);
return returnVal;
}
The TAU Performance System
64
SC2002 PERC Tutorial, Nov. 17, 2002
Multi-Level Instrumentation (Profiling)
four processes
event
legend
Profile per process
global profile
The TAU Performance System
65
SC2002 PERC Tutorial, Nov. 17, 2002
Multi-Level Instrumentation (Tracing)
Relink with TAU library configured for tracing
No modification of source instrumentation required!
TAU performance groups
The TAU Performance System
66
SC2002 PERC Tutorial, Nov. 17, 2002
Dynamic Instrumentation of SIMPLE
Uses DynInstAPI for runtime code patching
Mutator loads measurement library, instruments mutatee
One mutator (tau_run) per executable image
mpirun –np <n> tau.shell
The TAU Performance System
67
SC2002 PERC Tutorial, Nov. 17, 2002
Case Study: PETSc v2.1.3 (ANL)
Portable, Extensible Toolkit for Scientific Computation
Scalable (parallel) PDE framework
Parallel implementation
MPI used for inter-process communication
TAU instrumentation
Suite of data structures and routines (374,458 code lines)
Solution of scientific applications modeled by PDEs
PDT for C/C++ source instrumentation (100%, no manual)
MPI wrapper interposition library instrumentation
Example
Linear system of equations (Ax=b) (SLES) (ex2 test case)
Non-linear system of equations (SNES) (ex19 test case)
The TAU Performance System
68
SC2002 PERC Tutorial, Nov. 17, 2002
PETSc ex2 (Profile - wallclock time)
Sorted with respect to exclusive time
The TAU Performance System
69
SC2002 PERC Tutorial, Nov. 17, 2002
PETSc ex2(Profile - overall and message counts)
Observe
load
balance
Track
messages
Capture with userdefined events
The TAU Performance System
70
SC2002 PERC Tutorial, Nov. 17, 2002
PETSc ex2 (Profile - percentages and time)
View per thread
performance on
individual
routines
The TAU Performance System
71
SC2002 PERC Tutorial, Nov. 17, 2002
PETSc ex2 (Trace)
The TAU Performance System
72
SC2002 PERC Tutorial, Nov. 17, 2002
PETSc ex19
Non-linear solver (SNES)
2-D driven cavity code
Uses velocity-vorticity formulation
Finite difference discretization on a structured grid
Problem size and measurements
56x56 mesh size on quad Pentium III (550 Mhz, Linux)
Executes for approximately one minute
MPI wrapper interposition library
PDT (tau_instrumentor)
Selective instrumentation (tau_reduce)
three
routines identified with high instrumentation overhead
The TAU Performance System
73
SC2002 PERC Tutorial, Nov. 17, 2002
PETSc ex19 (Profile - wallclock time)
Sorted by inclusive time
Sorted by exclusive time
The TAU Performance System
74
SC2002 PERC Tutorial, Nov. 17, 2002
PETSc ex19 (Profile - overall and percentages)
The TAU Performance System
75
SC2002 PERC Tutorial, Nov. 17, 2002
PETSc ex19 (Tracing)
Commonly seen
communicaton
behavior
The TAU Performance System
76
SC2002 PERC Tutorial, Nov. 17, 2002
PETSc ex19 (Tracing - callgraph)
The TAU Performance System
77
SC2002 PERC Tutorial, Nov. 17, 2002
PETSc ex19 (PAPI_FP_INS, PAPI_L1_DCM)
PAPI_FP_INS
Uses multiple counter
profile measurement
PAPI_L1_DCM
The TAU Performance System
78
SC2002 PERC Tutorial, Nov. 17, 2002
Case Study: Mixed-mode Parallel Programs
Portable mixed-mode parallel programming
Performance measurement
Multi-threaded shared memory programming
Inter-node message passing
Access to runtime system and communication events
Associate communication and application events
2-Dimensional Stommel model of ocean circulation
OpenMP for shared memory parallel programming
MPI for cross-box message-based parallelism
Jacobi iteration, 5-point stencil
Timothy Kaiser (San Diego Supercomputing Center)
The TAU Performance System
79
SC2002 PERC Tutorial, Nov. 17, 2002
Stommel Instrumentation
OpenMP directive instrumentation (uses OPARI)
pomp_for_enter(&omp_rd_2);
#line 252 "stommel.c"
#pragma omp for schedule(static) reduction(+: diff) private(j)
firstprivate (a1,a2,a3,a4,a5) nowait
for( i=i1;i<=i2;i++) {
for(j=j1;j<=j2;j++){
new_psi[i][j]=a1*psi[i+1][j] + a2*psi[i-1][j] + a3*psi[i][j+1]
+ a4*psi[i][j-1] - a5*the_for[i][j];
diff=diff+fabs(new_psi[i][j]-psi[i][j]);
}
}
pomp_barrier_enter(&omp_rd_2);
#pragma omp barrier
pomp_barrier_exit(&omp_rd_2);
pomp_for_exit(&omp_rd_2);
#line 261 "stommel.c"
The TAU Performance System
80
SC2002 PERC Tutorial, Nov. 17, 2002
OpenMP + MPI Ocean Modeling (Trace)
Thread-paired
message passing
Integrated
OpenMP +
MPI events
The TAU Performance System
81
SC2002 PERC Tutorial, Nov. 17, 2002
OpenMP + MPI Ocean Modeling (HW Profile)
% configure -papi=../packages/papi -openmp -c++=pgCC -cc=pgcc
-mpiinc=../packages/mpich/include -mpilib=../packages/mpich/lib
Integrated
OpenMP +
MPI events
FP
instructions
The TAU Performance System
82
SC2002 PERC Tutorial, Nov. 17, 2002
Case Study: C++ and Performance Mapping
Object-oriented programming
Domain-specific abstractions
Implemented by OO languages in form of class libraries
Generic programming mechanisms
abstract data types, encapsulation, inheritance, …
efficient coding abstractions, compile-time transformations
Creates a semantic gap between the transformed code and
what the user expects (as describes in source code)
Need a mechanism to expose the nature of high-level
abstract computation to the performance tools
Map low-level performance data to high-level semantics
The TAU Performance System
83
SC2002 PERC Tutorial, Nov. 17, 2002
C++ Template Instrumentation (Blitz++, PETE)
High-level objects
Optimizations
Array classes
Templates (Blitz++)
Array processing
Expressions (PETE)
Array
expressions
Relate performance
data to high-level
statement
Complexity of
template evaluation
The TAU Performance System
84
SC2002 PERC Tutorial, Nov. 17, 2002
Standard Template Instrumentation Difficulties
Instantiated templates result in mangled identifiers
Standard profiling techniques / tools are deficient
Integrated with proprietary compilers
Specific systems platforms and programming models
Very long!
The TAU Performance System
Uninterpretable routine names
85
SC2002 PERC Tutorial, Nov. 17, 2002
Blitz++ Library Instrumentation
Expression templates
embed the form of the expression in a template name
Expression: B + C - 2.0 * D
+
B
C
+
2.0
BinOp<Add,
B, <BinOp<Subtract,
C, <BinOp<Multiply,
Scalar<2.0>, D>>>
D
Blitz++ describes structure of the expression template
Present as pretty printed name to the profiling toolkit
Create performance event associated with expression type
The TAU Performance System
86
SC2002 PERC Tutorial, Nov. 17, 2002
Blitz++ Library Instrumentation (example)
#ifdef BZ_TAU_PROFILING
static string exprDescription;
if (!exprDescription.length()) {
exprDescription = "A";
prettyPrintFormat format(_bz_true); // terse mode on
format.nextArrayOperandSymbol();
T_update::prettyPrint(exprDescription);
expr.prettyPrint(exprDescription, format);
}
TAU_PROFILE(" ", exprDescription, TAU_BLITZ);
#endif
exprDescription is the event name
The TAU Performance System
87
SC2002 PERC Tutorial, Nov. 17, 2002
TAU Instrumentation and Profiling for C++
Profile of
expression
types
Performance data presented
with respect to high-level
array expression types
The TAU Performance System
88
SC2002 PERC Tutorial, Nov. 17, 2002
Case Study: C-SAFE / Uintah
Center for Simulation of Accidental Fires & Explosions
ASCI ASAP Level 1 center, University of Utah
PSE for multi-model simulation high-energy explosion
Coupled non-linear solvers, optimization, computational
steering, visualization, and experimental data verification
Very large-scale simulations
Computer science problems:
Coupling of multiple simulation codes
Software engineering across diverse expert teams
Achieving high performance on large-scale systems
The TAU Performance System
89
SC2002 PERC Tutorial, Nov. 17, 2002
Example C-SAFE Simulation Problems
Heptane fire simulation
∑
Typical C-SAFE simulation with
a billion degrees of freedom and
non-linear time dynamics
Material stress simulation
The TAU Performance System
90
SC2002 PERC Tutorial, Nov. 17, 2002
Uintah Computational Framework (UCF)
Execution model based on software (macro) dataflow
Exposes parallelism and hides data transport latency
Computations expressed a directed acyclic graphs of tasks
consumes
input and produces output (input to future task)
input/outputs specified for each patch in a structured grid
Abstraction of global single-assignment memory
DataWarehouse
Directory mapping names to values (array structured)
Write value once then communicate to awaiting tasks
Task graph gets mapped to processing resources
Communications schedule approximates global optimal
The TAU Performance System
91
SC2002 PERC Tutorial, Nov. 17, 2002
Performance Technology Integration
Uintah present challenges to performance integration
Software diversity and structure
UCF
middleware, simulation code modules
component-based hierarchy
Portability objectives
cross-language
and cross-platform
multi-parallelism: thread, message passing, mixed
Scalability objectives
High-level programming and execution abstractions
Requires flexible and robust performance technology
Requires support for performance mapping
The TAU Performance System
92
SC2002 PERC Tutorial, Nov. 17, 2002
Task Execution in Uintah Parallel Scheduler
Profile methods
and functions in
scheduler and in
MPI library
Task execution time
dominates (what task?)
Task execution
time distribution
MPI communication
overheads (where?)
Need to map
performance data!
The TAU Performance System
93
SC2002 PERC Tutorial, Nov. 17, 2002
Uintah Task Performance Mapping
Uintah partitions individual particles across processing
elements (processes or threads)
Simulation tasks in task graph work on particles
Tasks have domain-specific character in the computation
“interpolate
particles to grid” in Material Point Method
Task instances generated for each partitioned particle set
Execution scheduled with respect to task dependencies
How to attributed execution time among different tasks
Assign semantic name (task type) to a task instance
SerialMPM::interpolateParticleToGrid
Map TAU timer object to (abstract) task (semantic entity)
Look up timer object using task type (semantic attribute)
Further partition along different domain-specific axes
The TAU Performance System
94
SC2002 PERC Tutorial, Nov. 17, 2002
Mapping Instrumentation in UCF (example)
Use TAU performance mapping API
void MPIScheduler::execute(const ProcessorGroup * pc,
DataWarehouseP
& old_dw,
DataWarehouseP
& dw ) {
...
TAU_MAPPING_CREATE(
task->getName(), "[MPIScheduler::execute()]",
(TauGroup_t)(void*)task->getName(), task->getName(), 0);
...
TAU_MAPPING_OBJECT(tautimer)
TAU_MAPPING_LINK(tautimer,(TauGroup_t)(void*)task->getName());
// EXTERNAL ASSOCIATION
...
TAU_MAPPING_PROFILE_TIMER(doitprofiler, tautimer, 0)
TAU_MAPPING_PROFILE_START(doitprofiler,0);
task->doit(pc);
TAU_MAPPING_PROFILE_STOP(0);
...
}
The TAU Performance System
95
SC2002 PERC Tutorial, Nov. 17, 2002
Task Performance Mapping (Profile)
Mapped task
performance
across processes
Performance
mapping for
different tasks
The TAU Performance System
96
SC2002 PERC Tutorial, Nov. 17, 2002
Work Packet – to – Task Mapping (Trace)
Work packet
computation
events colored
by task type
Distinct phases of
computation can be
identifed based on task
The TAU Performance System
97
SC2002 PERC Tutorial, Nov. 17, 2002
Comparing Uintah Traces for Scalability Analysis
8 processes
32 processes
32 processes
The TAU Performance System
8 processes
98
SC2002 PERC Tutorial, Nov. 17, 2002
Online Performance Analysis for C-SAFE Apps
SCIRun (Univ. of Utah)
Application
Performance
Steering
Performance
Visualizer
// performance
data streams
TAU
Performance
System
// performance
data output
file system
accumulated
samples
Performance
Data Integrator
Performance
Analyzer
Performance
Data Reader
• sample sequencing
• reader synchronization
The TAU Performance System
99
SC2002 PERC Tutorial, Nov. 17, 2002
2D Field Performance Visualization in SCIRun
SCIRun program
The TAU Performance System
100
SC2002 PERC Tutorial, Nov. 17, 2002
Uintah Computational Framework (UCF)
UCF analysis
Scheduling
MPI library
Components
500 processes
Online
and offline
visualization
Performance
steering
use SCIRun
support
The TAU Performance System
101
SC2002 PERC Tutorial, Nov. 17, 2002
Case Study: SAMRAI (LLNL)
Structured Adaptive Mesh Refinement Application
Infrastructure (SAMRAI)
Programming
C++ and MPI
SPMD
Instrumentation
PDT for automatic instrumentation of routines
MPI interposition wrappers
SAMRAI timers for interesting code segments
classified in groups (apps, mesh, …)
timer groups are managed by TAU groups
timers
The TAU Performance System
102
SC2002 PERC Tutorial, Nov. 17, 2002
SAMRAI (Profile)
Euler (2D)
return type routine name
The TAU Performance System
103
SC2002 PERC Tutorial, Nov. 17, 2002
SAMRAI Euler (Profile)
The TAU Performance System
104
SC2002 PERC Tutorial, Nov. 17, 2002
SAMRAI Euler (Trace)
The TAU Performance System
105
SC2002 PERC Tutorial, Nov. 17, 2002
Case Study: EVH1
Enhanced Virginia Hydrodynamics #1 (EVH1)
"TeraScale Simulations of Neutrino-Driven Supernovae
and Their Nucleosynthesis" SciDAC project
Configured to run a simulation of the Sedov-Taylor blast
wave solution in 2D spherical geometry
Performance study found EVH1 communication bound
for more than 64 processors
Predominant routine (>50% of execution time) at this
scale is MPI_ALLTOALL
Used in matrix transpose-like operations
The TAU Performance System
106
SC2002 PERC Tutorial, Nov. 17, 2002
EVH1 Execution Profile
The TAU Performance System
107
SC2002 PERC Tutorial, Nov. 17, 2002
EVH1 Execution Trace
MPI_Alltoall
is an execution
bottleneck
The TAU Performance System
108
SC2002 PERC Tutorial, Nov. 17, 2002
TAU Integration (Selected)
SAMRAI (LLNL)
Overture (LLNL)
C-SAFE (ASCI ASAP)
VTF (ASCI ASAP)
SAGE (ASCI LANL)
POOMA, POOMA-II (LANL, Code Sourcery)
PETSc (ANL)
CCA (DOE SciDAC)
GrACE (Rutgers)
Aurora / SCALEA (University of Vienna)
The TAU Performance System
109
SC2002 PERC Tutorial, Nov. 17, 2002
Work in Progress
Trace visualization
Runtime performance monitoring and analysis
Online performance data access
Performance analysis and visualization in SCIRun
Performance Database Framework
Event traces with counters (Vampir 3.0 will visualize)
EPILOG trace conversion
XML parallel profile representation of TAU profiles
PostgresSQL performance database
Next-generation PDT
Performance analysis for component software (CCA)
The TAU Performance System
110
SC2002 PERC Tutorial, Nov. 17, 2002
Concluding Remarks
Complex software and parallel computing systems pose
challenging performance analysis problems that require
robust methodologies and tools
To build more sophisticated performance tools, existing
proven performance technology must be utilized
Performance tools must be integrated with software and
systems models and technology
Performance engineered software
Function consistently and coherently in software and
system environments
TAU performance system offers robust performance
technology that can be broadly integrated … so USE IT!
The TAU Performance System
111
SC2002 PERC Tutorial, Nov. 17, 2002
Acknowledgements
Department of Energy (DOE)
MICS office
DOE
2000 ACTS contract
“Performance Technology for Tera-class Parallel Computer
Systems: Evolution of the TAU Performance System”
PERC SciDAC project affiliate
NSF National Young Investigator (NYI) award
Research Centre Juelich
University of Utah DOE ASCI Level 1 sub-contract
DOE ASCI Level 3 (LANL, LLNL)
John von Neumann Institute for Computing
Dr. Bernd Mohr
Los Alamos National Laboratory
The TAU Performance System
112
SC2002 PERC Tutorial, Nov. 17, 2002
Information
TAU (http://www.acl.lanl.gov/tau)
PDT (http://www.acl.lanl.gov/pdtoolkit)
PAPI (http://icl.cs.utk.edu/projects/papi/)
OPARI (http://www.fz-juelich.de/zam/kojak/)
The TAU Performance System
113
SC2002 PERC Tutorial, Nov. 17, 2002