Transcript Document 7334335
Performance Technology for Complex Parallel Systems
Sameer Shende University of Oregon
General Problems
How do we create robust and ubiquitous performance technology for the analysis and tuning of parallel and distributed software and systems in the presence of (evolving) complexity challenges?
How do we apply performance technology effectively for the variety and diversity of performance problems that arise in the context of complex parallel and distributed computer systems.
Computation Model for Performance Technology
How to address dual performance technology goals ?
Robust capabilities + widely available methodologies Contend with problems of system diversity Flexible tool composition/configuration/integration Approaches Restrict computation types / performance problems limited performance technology coverage Base technology on abstract computation model general architecture and software execution features map features/methods to existing complex system types develop capabilities that can adapt and be optimized
General Complex System Computation Model
Node
: physically distinct shared memory machine Message passing
node interconnection network
Context
: distinct virtual memory space within node
Thread
: execution threads (user/system) in context physical view
Node
memory model view
Interconnection Network
VM space
Context Node
node memory … … * SMP
Threads
* Inter-node message communication
Node
memory
Definitions – Profiling
Profiling Recording of summary information during execution inclusive, exclusive time, # calls, hardware statistics, … Reflects performance behavior of program entities functions, loops, basic blocks user-defined “semantic” entities Very good for low-cost performance assessment Helps to expose performance bottlenecks and hotspots Implemented through sampling : periodic OS interrupts or hardware counter traps instrumentation : direct insertion of measurement code
Definitions – Tracing
Tracing Recording of information about significant points ( events ) during program execution entering/exiting code region (function, loop, block, …) thread/process interactions (e.g., send/receive message) Save information in event record timestamp CPU identifier, thread identifier Event type and event-specific information Event trace is a time-sequenced stream of event records Can be used to reconstruct dynamic program behavior Typically requires code instrumentation
Event Tracing: Instrumentation , Monitor , Trace
CPU A:
void master { trace(ENTER, 1); ...
trace(SEND, B); send(B, tag, buf); ...
trace(EXIT, 1); }
timestamp Event definition
1 master 2 3 slave ...
CPU B:
void slave { trace(ENTER, 2); ...
recv(A, tag, buf); trace(RECV, A); ...
trace(EXIT, 2); }
MONITOR
...
58 60 A B 62 A 64 A 68 B 69 ...
B ENTER ENTER SEND EXIT RECV EXIT 1 2 B 1 A 2
Event Tracing: “Timeline” Visualization
1 2 3 master slave
...
...
58 60 62 64 68 69
...
A B A A B B ENTER ENTER SEND EXIT RECV EXIT 1 2 B 1 A 2 A B main master slave 58 60 62 64 66 68 70
TAU Performance System Framework
T uning and A nalysis U tilities Performance system framework for scalable parallel and distributed high-performance computing Targets a general complex system computation model nodes / contexts / threads Multi-level: system / software / parallelism Measurement and analysis abstraction Integrated toolkit for performance instrumentation, measurement, analysis, and visualization Portable performance profiling/tracing facility Open software approach
TAU Performance System Architecture
Levels of Code Transformation
As program information flows through stages of compilation/linking/execution, different information is accessible at different stages Each level poses different constraints and opportunities for extracting information At what level should performance instrumentation be done?
TAU Instrumentation
Flexible instrumentation mechanisms at multiple levels Source code manual automatic using
Program Database Toolkit (PDT), OPARI
Object code pre-instrumented libraries (e.g., MPI using PMPI) statically linked dynamically linked (e.g., Virtual machine instrumentation) fast breakpoints (compiler generated) Executable code dynamic instrumentation (pre-execution) using
DynInstAPI
TAU Instrumentation (continued)
Targets common measurement interface (
TAU API
) Object-based design and implementation Macro-based, using constructor/destructor techniques Program units: function , classes , templates , blocks Uniquely identify functions and templates name and type signature (name registration) static object creates performance entry dynamic object receives static object pointer runtime type identification for template instantiations C and Fortran instrumentation variants Instrumentation and measurement optimization
Multi-Level Instrumentation
Uses multiple instrumentation interfaces Shares information: cooperation between interfaces Taps information at multiple levels Provides selective instrumentation at each level Targets a common performance model Presents a unified view of execution
Program Database Toolkit (PDT)
Program code analysis framework for developing source based tools High-level interface to source code information Integrated toolkit for source code parsing, database creation, and database query commercial grade front end parsers portable IL analyzer, database format, and access API open software approach for tool development Target and integrate multiple source languages Use in TAU to build automated performance instrumentation tools
PDT Architecture and Tools
C/C++ Fortran 77/90
PDT Components
Language front end Edison Design Group (EDG): C, C++, Java Mutek Solutions Ltd.: F77, F90 creates an intermediate-language (IL) tree IL Analyzer processes the intermediate language (IL) tree creates “program database” (PDB) formatted file DUCTAPE (Bernd Mohr, ZAM, Germany) C++ program D atabase U tilities and C onversion T ools AP plication E nvironment processes and merges PDB files C++ library to access the PDB for PDT applications
TAU Measurement
Performance information High-resolution timer library (real-time / virtual clocks) General software counter library (user-defined events) Hardware performance counters
PCL
(Performance Counter Library) (ZAM, Germany)
PAPI
(Performance API) (UTK, Ptools Consortium) consistent, portable API Organization Node, context, thread levels Profile groups for collective events (runtime selective) Performance data mapping between software levels
TAU Measurement (continued)
Parallel profiling Function-level, block-level, statement-level Supports user-defined events TAU parallel profile database Function callstack Hardware counts values (in replace of time) Tracing All profile-level events Inter-process communication events Timestamp synchronization User configurable measurement library (user controlled)
TAU Measurement System Configuration
configure [OPTIONS] { c++ =
TAU Measurement Configuration – Examples
./configure -c++=xlC -cc=xlc –pdt=/usr/packages/pdtoolkit-2.1
-pthread Use TAU with IBM’s xlC compiler, PDT and the pthread library Enable TAU profiling (default) ./configure -TRACE –PROFILE Enable both TAU profiling and tracing ./configure -c++=guidec++ -cc=guidec -papi=/usr/local/packages/papi –openmp -mpiinc=/usr/packages/mpich/include -mpilib=/usr/packages/mpich/lib Use OpenMP+MPI using KAI's Guide compiler suite and use PAPI for accessing hardware performance counters for measurements Typically configure multiple measurement libraries
TAU Measurement API
Initialization and runtime configuration TAU_PROFILE_INIT ( argc, argv ); TAU_PROFILE_SET_NODE ( myNode ); TAU_PROFILE_SET_CONTEXT ( myContext ); TAU_PROFILE_EXIT ( message ); TAU_REGISTER_THREAD (); Function and class methods TAU_PROFILE ( name, type, group ); Template TAU_TYPE_STRING ( variable, type ); TAU_PROFILE ( name, type, group ); CT ( variable ); User-defined timing TAU_PROFILE_TIMER ( timer, name, type, group ); TAU_PROFILE_START ( timer ); TAU_PROFILE_STOP ( timer );
Compiling: TAU Makefiles
Include TAU Makefile in the user’s Makefile. Variables: TAU_CXX TAU_CC TAU_DEFS TAU_LDFLAGS TAU_INCLUDE TAU_LIBS TAU_SHLIBS TAU_MPI_LIBS TAU_MPI_FLIBS TAU_FORTRANLIBS Specify the C++ compiler Specify the C compiler used by TAU Defines used by TAU. Add to CFLAGS Linker options. Add to LDFLAGS Header files include path. Add to CFLAGS Statically linked TAU library. Add to LIBS Dynamically linked TAU library TAU’s MPI wrapper library for C/C++ TAU’s MPI wrapper library for F90 Must be linked in with C++ linker for F90.
Note: Not including TAU_DEFS in CFLAGS disables instrumentation in C/C++ programs.
Including TAU Makefile - Example
include /usr/tau/sgi64/lib/Makefile.tau-pthread-kcc CXX = $(TAU_CXX) CC = $(TAU_CC) CFLAGS = $(TAU_DEFS) LIBS = $(TAU_LIBS) OBJS = ...
TARGET= a.out
TARGET: $(OBJS) $(CXX) $(LDFLAGS) $(OBJS) -o $@ $(LIBS) .cpp.o: $(CC) $(CFLAGS) -c $< -o $@
TAU Makefile for PDT
include /usr/tau/include/Makefile CXX = $(TAU_CXX) CC = $(TAU_CC) PDTPARSE = $(PDTDIR)/$(CONFIG_ARCH)/bin/cxxparse TAUINSTR = $(TAUROOT)/$(CONFIG_ARCH)/bin/tau_instrumentor CFLAGS = LIBS = $(TAU_DEFS) $(TAU_LIBS) OBJS = ...
TARGET= a.out
TARGET: $(OBJS) $(CXX) $(LDFLAGS) $(OBJS) -o $@ $(LIBS) .cpp.o: $(PDTPARSE) $< $(TAUINSTR) $*.pdb $< -o $*.inst.cpp
$(CC) $(CFLAGS) -c $*.inst.cpp
-o $@
Setup: Running Applications
% setenv PROFILEDIR /home/data/experiments/profile/01 % setenv TRACEDIR /home/data/experiments/trace/01 % set path=($path
% tau_run a.out
% tau_run -XrunTAUsh-papi a.out
TAU Analysis
Profile analysis pprof parallel profiler with text-based display racy graphical interface to pprof (Tcl/Tk) jracy Java implementation of Racy Trace analysis and visualization Trace merging and clock adjustment (if necessary) Trace format conversion (ALOG, SDDF, Vampir) Vampir (Pallas) trace visualization
Pprof Command
pprof [-c|-b|-m|-t|-e|-i] [-r] [-s] [-n num] [-f file] [-l] [nodes] c Sort according to number of calls b m Sort according to number of subroutines called Sort according to msecs (exclusive time total) Sort according to total msecs (inclusive time t total) e i v usec) Sort according to exclusive time per call Sort according to inclusive time per call Sort according to standard deviation (exclusive r s n num f file l Reverse sorting order Print only summary profile information Print only first number of functions Specify full path and filename without node ids List all functions and exit
Pprof Output (NAS Parallel Benchmark – LU)
Intel Quad PIII Xeon, RedHat, PGI F90 F90 + MPICH Profile for: Node Context Thread Application events and MPI events
jRacy (NAS Parallel Benchmark – LU)
Global profiles Routine profile across all nodes n: node c: context t: thread Individual profile
Vampir Trace Visualization Tool
V isualization and A nalysis of MPI P r ograms Originally developed by Forschungszentrum Jülich Current development by Technical University Dresden Distributed by PALLAS, Germany
http://www.pallas.de/pages/vampir.htm
Vampir (NAS Parallel Benchmark – LU)
Callgraph display Timeline display Parallelism display Communications display
Case Study: Hybrid Computation (OpenMPI + MPI)
Portable hybrid parallel programming OpenMP for shared memory parallel programming Fork-join model Loop level parallelism MPI for cross-box message-based parallelism OpenMP performance measurement Interface to OpenMP runtime system (RTS events) Compiler support and integration 2D Stommel model of ocean circulation Jacobi iteration, 5-point stencil Timothy Kaiser (San Diego Supercomputing Center)
OpenMP Instrumentation
OPARI [FZJ, Germany] O penMP P ragma A nd R egion I nstrumentor (OPARI) Source-to-Source translator to insert POMP calls around OpenMP constructs and API functions POMP OpenMP Directive Instrumentation OpenMP Runtime Library Routine Instrumentation Performance Monitoring Library Control User Code Instrumentation Context Descriptors Conditional Compilation Conditional / Selective Transformations
Example:
!$OMP PARALLEL DO
Instrumentation
call pomp_parallel_fork(d) call pomp_parallel_begin(d) call pomp_do_enter(d) !$OMP DO schedule-clauses, ordered-clauses,
lastprivate-clauses do loop
!$OMP END DO NOWAIT call pomp_barrier_enter(d) !$OMP BARRIER call pomp_barrier_exit(d) call pomp_do_exit(d) call pomp_parallel_end(d) call pomp_parallel_join(d)
Tracing Hybrid Executions – TAU and Vampir
Profiling Hybrid Executions
Case Study: Utah ASCI/ASAP Level 1 Center
C-SAFE was established to build a problem-solving environment (PSE) for the numerical simulation of accidental fires and explosions Fundamental chemistry and engineering physics models Coupled with non-linear solvers, optimization, computational steering, visualization, and experimental data verification Very large-scale simulations Computer science problems: Coupling of multiple simulation codes Software engineering across diverse expert teams Achieving high performance on large-scale systems
Example C-SAFE Simulation Problems
Heptane fire simulation
∑
Material stress simulation Typical C-SAFE simulation with a billion degrees of freedom and non-linear time dynamics
Uintah High-Level Component View
Uintah Parallel Component Architecture
Problem Specification
C-SAFE High Level Architecture
Scheduler Simulation Controller Data Manager Database Mixing Model Subgrid Model Chemistry Database Controller Fluid Model Numerical Solvers MPM Post Processing And Analysis Material Properties Database Parallel Services Numerical Solvers Chemistry Databases High Energy Simulations
Non-PSE Components
Resource Management
Implicitly Connected to All Components
Visualization Performance Analysis
UCF
Checkpointing
PSE Components Data Control / Light Data
Blazer
Uintah Computational Framework
Execution model based on software (macro) dataflow Exposes parallelism and hides data transport latency Computations expressed a directed acyclic graphs of tasks consumes input and produces output (input to future task) input/outputs specified for each patch in a structured grid Abstraction of global single-assignment memory
DataWarehouse
Directory mapping names to values (array structured) Write value once then communicate to awaiting tasks Task graph gets mapped to processing resources Communications schedule approximates global optimal
Uintah Task Graph (Material Point Method)
Diagram of named tasks (ovals) and data (edges) Imminent computation Dataflow-constrained MPM Newtonian material point motion time step Solid: values defined at material point (particle) Dashed: values defined at vertex (grid) Prime (‘): values updated during time step
Uintah PSE
UCF automatically sets up: Domain decomposition Inter-processor communication with aggregation/reduction Parallel I/O Checkpoint and restart Performance measurement and analysis (stay tuned) Software engineering Coding standards CVS (Commits: Y3 - 26.6 files/day, Y4 - 29.9 files/day) Correctness regression testing with bugzilla bug tracking Nightly build (parallel compiles) 170,000 lines of code (Fortran and C++ tasks supported)
Performance Technology Integration
Uintah present challenges to performance integration Software diversity and structure UCF middleware, simulation code modules component-based hierarchy Portability objectives cross-language and cross-platform multi-parallelism: thread, message passing, mixed Scalability objectives High-level programming and execution abstractions Requires flexible and robust performance technology Requires support for performance mapping
Performance Analysis Objectives for Uintah
Micro tuning Optimization of simulation code (task) kernels for maximum serial performance Scalability tuning Identification of parallel execution bottlenecks overheads: scheduler, data warehouse, communication load imbalance Adjustment of task graph decomposition and scheduling Performance tracking Understand performance impacts of code modifications Throughout course of software development C-SAFE application and UCF software
Uintah Performance Engineering Approach
Contemporary performance methodology focuses on control flow (function) level measurement and analysis C-SAFE application involves coupled-models with task based parallelism and dataflow control constraints Performance engineering on
algorithmic
(task) basis Observe performance based on algorithm (task) semantics Analyze task performance characteristics in relation to other simulation tasks and UCF components scientific component developers can concentrate on performance improvement at algorithmic level UCF developers can concentrate on bottlenecks not directly associated with simulation module code
Task Execution in Uintah Parallel Scheduler
Profile methods and functions in scheduler and in MPI library Task execution time dominates (what task?) Task execution time distribution MPI communication overheads (where?) Need to
map
performance data!
Semantics-Based Performance Mapping
Associate performance measurements with high-level semantic abstractions Need mapping support in the performance measurement system to assign data correctly
Hypothetical Mapping Example
Particles distributed on surfaces of a cube
Particle* P[MAX];
/* Array of particles */
int GenerateParticles () { /* distribute particles over all faces of the cube */ for (int face =0, last=0; face < 6; face ++){
/* particles on this face */
int particles_on_this_face = num( face ); for (int i=last; i < particles_on_this_face; i++) { /* particle properties are a function of face */ P[i] = ... f( face ); ...
} last+= particles_on_this_face; } }
Hypothetical Mapping Example (continued)
int ProcessParticle ( Particle *p ) {
/* perform some computation on p */
} int main() { GenerateParticles ();
/* create a list of particles */
for (int i = 0; i < N; i++)
/* iterates over the list */
ProcessParticle (P[i]); }
How much time is spent processing face
i
particles?
What is the distribution of performance among faces ?
How is this determined if execution is parallel?
Semantic Entities/Attributes/Associations (SEAA)
New dynamic mapping scheme Entities defined at any level of abstraction Attribute entity with semantic information Entity-to-entity associations Two association types (implemented in TAU API) Embedded – extends data structure of associated object to store performance measurement entity External – creates an external look-up table using address of object as the key to locate performance measurement entity
No Performance Mapping versus Mapping
Typical performance tools report performance with respect to routines Does not provide support for mapping TAU (no mapping) Performance tools with SEAA mapping can observe performance with respect to scientist’s programming and problem abstractions TAU (w/ mapping)
Uintah Task Performance Mapping
Uintah partitions individual particles across processing elements (processes or threads) Simulation tasks in task graph work on particles Tasks have domain-specific character in the computation “ interpolate particles to grid ” in Material Point Method
Task instances
generated for each partitioned particle set Execution scheduled with respect to task dependencies How to attributed execution time among different tasks Assign semantic name (task type) to a task instance SerialMPM::interpolateParticleToGrid Map TAU timer object to (abstract) task (semantic entity) Look up timer object using task type (semantic attribute) Further partition along different domain-specific axes
Using External Associations
Two level mappings: Level 1:
...
Task Performance Mapping Instrumentation
void MPIScheduler::execute(const ProcessorGroup * pc, DataWarehouseP & old_dw, DataWarehouseP & dw ) { ...
TAU_MAPPING_CREATE ( task->getName(), "[MPIScheduler::execute()]", (TauGroup_t)(void*)task->getName(), task->getName(), 0); ...
TAU_MAPPING_OBJECT (tautimer) TAU_MAPPING_LINK (tautimer,(TauGroup_t)(void*)task->getName());
// EXTERNAL ASSOCIATION
...
TAU_MAPPING_PROFILE_TIMER (doitprofiler, tautimer, 0) TAU_MAPPING_PROFILE_START (doitprofiler,0); task->doit(pc); TAU_MAPPING_PROFILE_STOP (0); ...
}
Task Performance Mapping (Profile)
Mapped task performance across processes Performance mapping for different tasks
Task Performance Mapping (Trace)
Work packet computation events colored by task type Distinct phases of computation can be identifed based on task
Task Performance Mapping (Trace - Zoom)
Startup communication imbalance
Task Performance Mapping (Trace - Parallelism)
Communication / load imbalance
Comparing Uintah Traces for Scalability Analysis
8 processes 32 processes 8 processes
Scaling Performance Optimizations
ASCI Nirvana SGI Origin 2000 Los Alamos National Laboratory Last year: initial “correct” scheduler Reduce communication by 10 x Reduce task graph overhead by 20 x
Scalability to 2000 Processors (Fall 2001)
ASCI Nirvana SGI Origin 2000 Los Alamos National Laboratory
TAU Performance System Status
Computing platforms IBM SP, SGI Origin, Intel Teraflop, Cray T3E, Compaq SC, HP, Sun, Apple, Windows, IA-32, IA-64 (Linux), … Programming languages C, C++, Fortran 77/90, HPF, Java Communication libraries MPI, PVM, Nexus, Tulip, ACLMPL, MPIJava Thread libraries pthread, Java,Windows, SGI sproc, Tulip, SMARTS, OpenMP Compilers KAI, PGI, GNU, Fujitsu, HP, Sun, Microsoft, SGI, Cray, IBM, Compaq
PDT Status
Program Database Toolkit (Version 2.1, web download) EDG C++ front end (Version 2.45.2) Mutek Fortran 90 front end (Version 2.4.1) C++ and Fortran 90 IL Analyzer DUCTAPE library Standard C++ system header files (KCC Version 4.0f) PDT-constructed tools TAU instrumentor (C/C++/F90) Program analysis support for SILOON and CHASM Platforms SGI, IBM, Compaq, SUN, HP, Linux (IA32/IA64), Apple, Windows, Cray T3E
Evolution of the TAU Performance System
Customization of TAU for specific needs TAU’s existing strength lies in its robust support for performance instrumentation and measurement TAU will evolve to support new performance capabilities Online performance data access via application-level API Dynamic performance measurement control Generalize performance mapping Runtime performance analysis and visualization
Information
TAU ( http://www.acl.lanl.gov/tau ) PDT ( http://www.acl.lanl.gov/pdtoolkit )
Support Acknowledgement
TAU and PDT support: Department of Energy (DOE) DOE 2000 ACTS contract DOE MICS contract DOE ASCI Level 3 (LANL, LLNL) U. of Utah DOE ASCI Level 1 subcontract DARPA NSF National Young Investigator (NYI) award