Integrating Performance Analysis in Complex Scientific Software: Experiences with the Uintah Computational Framework

Transcript Integrating Performance Analysis in Complex Scientific Software: Experiences with the Uintah Computational Framework

Integrating Performance Analysis in
Complex Scientific Software:
Experiences with the
Uintah Computational Framework
Allen D. Malony
[email protected]
Department of Computer and Information Science
Computational Science Institute
University of Oregon
Acknowledgements



Sameer Shende, Robert Bell
University of Oregon
Steven Parker, J. Dav de St.-Germain, and Alan Morris
University of Utah
Department of Energy (DOE), ASCI Academic Strategic
Alliances Program (ASAP)


April 9, 2002
Center for Simulation of Accidental Fires and
Explosions (C-SAFE), ASCI/ASAP Level 1 center,
University of Utah, http://www.csafe.utah.edu
Computational Science Institute, ASCI/ASAP
Level 3 projects with LLNL / LANL,
University of Oregon, http://www.csi.uoregon.edu
Research Centre Juelich
Complex Parallel Systems

Complexity in computing system architecture

Diverse parallel system architectures
 shared




/ distributed memory, cluster, hybrid, NOW, Grid, …
Sophisticated processor and memory architectures
Advanced network interface and switching architecture
Specialization of hardware components
Complexity in parallel software environment

Diverse parallel programming paradigms
 shared
memory multi-threading, message passing, hybrid

Hierarchical, multi-level software architectures
Optimizing compilers and sophisticated runtime systems
Advanced numerical libraries and application frameworks
April 9, 2002
Research Centre Juelich


Complexity Drives Performance Need / Technology

Observe/analyze/understand performance behavior





Multiple levels of software and hardware
Different types and detail of performance data
Alternative performance problem solving methods
Multiple targets of software and system application
Robust AND ubiquitous performance technology





April 9, 2002
Broad scope of performance observability
Flexible and configurable mechanisms
Technology integration and extension
Cross-platform portability
Open, layered, and modular framework architecture
Research Centre Juelich
What is Parallel Performance Technology?

Performance instrumentation tools



Performance measurement (observation) tools





Profiling and tracing of SW/HW performance events
Different software (SW) and hardware (HW) levels
Performance analysis tools


Different program code levels
Different system levels
Performance data analysis and presentation
Online and offline tools
Performance experimentation and data management
Performance modeling and prediction tools
April 9, 2002
Research Centre Juelich
Complexity Challenges for Performance Tools

Computing system environment complexity





Observation integration and optimization
Access, accuracy, and granularity constraints
Diverse/specialized observation capabilities/technology
Restricted modes limit performance problem solving
Sophisticated software development environments





April 9, 2002
Programming paradigms and performance models
Performance data mapping to software abstractions
Uniformity of performance abstraction across platforms
Rich observation capabilities and flexible configuration
Common performance problem solving methods
Research Centre Juelich
General Problems
How do we create robust and ubiquitous
performance technology for the analysis and tuning
of parallel and distributed software and systems in
the presence of (evolving) complexity challenges?

How do we apply performance technology effectively
for the variety and diversity of performance
problems that arise in the context of complex
parallel and distributed computer systems?
April 9, 2002
Research Centre Juelich
Scientific Software Engineering

Modern scientific simulation software is complex




Large development teams of diverse expertise
Simultaneous development on different system parts
Iterative, multi-stage, long-term software development
Need support for managing complex software process


Software engineering tools for revision control,
automated testing, and bug tracking are commonplace
Tools for HPC performance engineering are not
 evaluation
(measurement, analysis, benchmarking)
 optimization (diagnosis, tracking, prediction, tuning)

April 9, 2002
Incorporate performance engineering methodology and
support by flexible and robust performance tools
Research Centre Juelich
Computation Model for Performance Technology

How to address dual performance technology goals?




Robust capabilities + widely available methodologies
Contend with problems of system diversity
Flexible tool composition/configuration/integration
Approaches

Restrict computation types / performance problems
 limited

performance technology coverage
Base technology on abstract computation model
 general
architecture and software execution features
 map features/methods to existing complex system types
 develop capabilities that can adapt and be optimized
April 9, 2002
Research Centre Juelich
General Complex System Computation Model

Node: physically distinct shared memory machine



Message passing node interconnection network
Context: distinct virtual memory space within node
Thread: execution threads (user/system) in context
Interconnection Network
physical
view
node memory
memory
VM
space
model
view
…
Node
SMP
memory
…
Context
April 9, 2002
*
Node
Node
message
* Inter-node
communication
Threads
Research Centre Juelich
Framework for Performance Problem Solving

Model-based performance technology

Instrumentation / measurement / execution models
 performance
observability constraints
 performance data types and events

Analysis / presentation model
 performance
data processing
 performance views and model mapping

Integration model
 performance

tool component configuration / integration
Can a performance problem solving framework be
designed based on a general complex system model and
with a performance technology model approach?
April 9, 2002
Research Centre Juelich
TAU Performance System Framework



Tuning and Analysis Utilities
Performance system framework for scalable parallel and
distributed high-performance computing
Targets a general complex system computation model




nodes / contexts / threads
Multi-level: system / software / parallelism
Measurement and analysis abstraction
Integrated toolkit for performance instrumentation,
measurement, analysis, and visualization


April 9, 2002
Portable performance profiling/tracing facility
Open software approach
Research Centre Juelich
TAU Performance System Architecture
Paraver
EPILOG
April 9, 2002
Research Centre Juelich
Pprof Output (NAS Parallel Benchmark – LU)
Intel Quad
PIII Xeon,
RedHat,
PGI F90
 F90 +
MPICH
 Profile for:
Node
Context
Thread
 Application
events and
MPI events

April 9, 2002
Research Centre Juelich
jRacy (NAS Parallel Benchmark – LU)
n: node
c: context
t: thread
Global profiles
Routine
profile across
all nodes
Individual profile
April 9, 2002
Research Centre Juelich
TAU + PAPI (NAS Parallel Benchmark – LU )



Floating
point
operations
Replaces
execution
time
Only
requires
re-linking to
different
TAU library
April 9, 2002
Research Centre Juelich
TAU + Vampir (NAS Parallel Benchmark – LU)
Timeline display
Callgraph display
Parallelism display
Communications
display
April 9, 2002
Research Centre Juelich
Utah ASCI/ASAP Level 1 Center (C-SAFE)

C-SAFE was established to build a problem-solving
environment (PSE) for the numerical simulation of
accidental fires and explosions




Fundamental chemistry and engineering physics models
Coupled with non-linear solvers, optimization,
computational steering, visualization, and experimental
data verification
Very large-scale simulations
Computer science problems:



April 9, 2002
Coupling of multiple simulation codes
Software engineering across diverse expert teams
Achieving high performance on large-scale systems
Research Centre Juelich
Example C-SAFE Simulation Problems
Heptane fire simulation
Material stress simulation
April 9, 2002
∑
Typical C-SAFE simulation with
a billion degrees of freedom and
non-linear time dynamics
Research Centre Juelich
Uintah Problem Solving Environment

Enhanced SCIRun PSE




Pure dataflow to component-based
Shared memory to scalable multi-/mixed-mode parallelism
Interactive only to interactive and standalone
Design and implement Uintah component architecture

Application programmers provide
 description
of computation (tasks and variables)
 code to perform task on single “patch” (sub-region of space)


Follow Common Component Architecture (CCA) model
Design and implement Uintah Computational Framework
(UCF) on top of the component architecture
April 9, 2002
Research Centre Juelich
Uintah High-Level Component View
April 9, 2002
Research Centre Juelich
Uintah Parallel Component Architecture
C-SAFE
Problem Specification
High Level Architecture
Scheduler
Subgrid
Model
Mixing
Model
Simulation
Controller
Fluid
Model
MPM
Data
Manager
Numerical
Solvers
High Energy
Simulations
Material
Properties
Database
Post Processing
And Analysis
Numerical
Solvers
Parallel
Services
Resource
Management
Visualization
Database
Chemistry
Databases
Chemistry
Database
Controller
Performance
Analysis
Non-PSE Components
Implicitly
Connected to
All Components
UCF
Data
Checkpointing
PSE Components
Control / Light Data
Blazer
April 9, 2002
Research Centre Juelich
Uintah Computational Framework

Execution model based on software (macro) dataflow


Exposes parallelism and hides data transport latency
Computations expressed a directed acyclic graphs of tasks
 consumes
input and produces output (input to future task)
 input/outputs specified for each patch in a structured grid

Abstraction of global single-assignment memory





DataWarehouse
Directory mapping names to values (array structured)
Write value once then communicate to awaiting tasks
Task graph gets mapped to processing resources
Communications schedule approximates global optimal
April 9, 2002
Research Centre Juelich
Uintah Task Graph (Material Point Method)


Diagram of named tasks
(ovals) and data (edges)
Imminent computation


Dataflow-constrained
MPM




April 9, 2002
Newtonian material point
motion time step
Solid: values defined at
material point (particle)
Dashed: values defined at
vertex (grid)
Prime (‘): values updated
during time step
Research Centre Juelich
Example Taskgraphs (MPM and Coupled)
April 9, 2002
Research Centre Juelich
Taskgraph Advantages




Accommodates flexible integration needs
Accommodates a wide range of unforeseen work loads
Accommodates a mix of static and dynamic load balance
Manage complexity of mixed-mode programming


Simulation time/space coupling


Avoids unnecessary transport abstraction overheads
Allows uniform abstraction for coordinating coupled
models’ time and grid scales
Allows application components and framework
infrastructure (e.g., scheduler) to evolve independently
April 9, 2002
Research Centre Juelich
Uintah PSE

UCF automatically sets up:






Domain decomposition
Inter-processor communication with aggregation/reduction
Parallel I/O
Checkpoint and restart
Performance measurement and analysis (stay tuned)
Software engineering





April 9, 2002
Coding standards
CVS (Commits: Y3 - 26.6 files/day, Y4 - 29.9 files/day)
Correctness regression testing with bugzilla bug tracking
Nightly build (parallel compiles)
170,000 lines of code (Fortran and C++ tasks supported)
Research Centre Juelich
Performance Technology Integration

Uintah present challenges to performance integration

Software diversity and structure
 UCF
middleware, simulation code modules
 component-based hierarchy

Portability objectives
 cross-language
and cross-platform
 multi-parallelism: thread, message passing, mixed




Scalability objectives
High-level programming and execution abstractions
Requires flexible and robust performance technology
Requires support for performance mapping
April 9, 2002
Research Centre Juelich
Performance Analysis Objectives for Uintah

Micro tuning


Optimization of simulation code (task) kernels for
maximum serial performance
Scalability tuning

Identification of parallel execution bottlenecks
 overheads:
scheduler, data warehouse, communication
 load imbalance


Adjustment of task graph decomposition and scheduling
Performance tracking


Understand performance impacts of code modifications
Throughout course of software development
 C-SAFE
April 9, 2002
application and UCF software
Research Centre Juelich
Uintah Performance Engineering Approach



Contemporary performance methodology focuses on
control flow (function) level measurement and analysis
C-SAFE application involves coupled-models with taskbased parallelism and dataflow control constraints
Performance engineering on algorithmic (task) basis


Observe performance based on algorithm (task) semantics
Analyze task performance characteristics in relation to
other simulation tasks and UCF components
 scientific
component developers can concentrate on
performance improvement at algorithmic level
 UCF developers can concentrate on bottlenecks not directly
associated with simulation module code
April 9, 2002
Research Centre Juelich
Task Execution in Uintah Parallel Scheduler

Profile methods
and functions in
scheduler and in
MPI library
Task execution time
dominates (what task?)
Task execution
time distribution
MPI communication
overheads (where?)

Need to map
performance data!
April 9, 2002
Research Centre Juelich
Semantics-Based Performance Mapping


Associate
performance
measurements
with high-level
semantic
abstractions
Need mapping
support in the
performance
measurement
system to assign
data correctly
April 9, 2002
Research Centre Juelich
Hypothetical Mapping Example

Particles distributed on surfaces of a cube
Particle* P[MAX]; /* Array of particles */
int GenerateParticles() {
/* distribute particles over all faces of the cube */
for (int face=0, last=0; face < 6; face++){
/* particles on this face */
int particles_on_this_face = num(face);
for (int i=last; i < particles_on_this_face; i++) {
/* particle properties are a function of face */
P[i] = ... f(face);
...
}
last+= particles_on_this_face;
}
}
April 9, 2002
Research Centre Juelich
Hypothetical Mapping Example (continued)
int ProcessParticle(Particle *p) {
/* perform some computation on p */
}
int main() {
GenerateParticles();
/* create a list of particles */
for (int i = 0; i < N; i++)
/* iterates over the list */
ProcessParticle(P[i]);
}



How much time is spent processing face i particles?
What is the distribution of performance among faces?
How is this determined if execution is parallel?
April 9, 2002
Research Centre Juelich
Semantic Entities/Attributes/Associations (SEAA)

New dynamic mapping scheme (S. Shende, Ph.D. thesis)





Contrast with ParaMap (Miller and Irvin)
Entities defined at any level of abstraction
Attribute entity with semantic information
Entity-to-entity associations
Two association types (implemented in TAU API)

Embedded – extends data structure of associated object to
store performance measurement entity
External – creates an external look-up table using address
of object as the key to locate performance measurement
entity
April 9, 2002
Research Centre Juelich

No Performance Mapping versus Mapping


Typical performance
tools report performance
with respect to routines
Does not provide support
for mapping
TAU (no mapping)
April 9, 2002

Performance tools with
SEAA mapping can
observe performance with
respect to scientist’s
programming and
problem abstractions
TAU (w/ mapping)
Research Centre Juelich
Uintah Task Performance Mapping


Uintah partitions individual particles across processing
elements (processes or threads)
Simulation tasks in task graph work on particles

Tasks have domain-specific character in the computation
 “interpolate



particles to grid” in Material Point Method
Task instances generated for each partitioned particle set
Execution scheduled with respect to task dependencies
How to attributed execution time among different tasks

Assign semantic name (task type) to a task instance
 SerialMPM::interpolateParticleToGrid



Map TAU timer object to (abstract) task (semantic entity)
Look up timer object using task type (semantic attribute)
Further partition along different domain-specific axes
April 9, 2002
Research Centre Juelich
Task Performance Mapping Instrumentation
void MPIScheduler::execute(const ProcessorGroup * pc,
DataWarehouseP
& old_dw,
DataWarehouseP
& dw ) {
...
TAU_MAPPING_CREATE(
task->getName(), "[MPIScheduler::execute()]",
(TauGroup_t)(void*)task->getName(), task->getName(), 0);
...
TAU_MAPPING_OBJECT(tautimer)
TAU_MAPPING_LINK(tautimer,(TauGroup_t)(void*)task->getName());
// EXTERNAL ASSOCIATION
...
TAU_MAPPING_PROFILE_TIMER(doitprofiler, tautimer, 0)
TAU_MAPPING_PROFILE_START(doitprofiler,0);
task->doit(pc);
TAU_MAPPING_PROFILE_STOP(0);
...
}
April 9, 2002
Research Centre Juelich
Task Performance Mapping (Profile)
Mapped task
performance
across processes
Performance
mapping for
different tasks
April 9, 2002
Research Centre Juelich
Task Performance Mapping (Trace)
Work packet
computation
events colored
by task type
April 9, 2002
Distinct phases of
computation can be
identifed based on task
Research Centre Juelich
Task Performance Mapping (Trace - Zoom)
Startup
communication
imbalance
April 9, 2002
Research Centre Juelich
Task Performance Mapping (Trace - Parallelism)
Communication
/ load imbalance
April 9, 2002
Research Centre Juelich
Comparing Uintah Traces for Scalability Analysis
8 processes
32 processes
32 processes
April 9, 2002
8 processes
Research Centre Juelich
Scaling Performance Optimizations
Last year:
initial “correct”
scheduler
Reduce
communication
by 10 x
ASCI Nirvana
SGI Origin 2000
Los Alamos
National Laboratory
April 9, 2002
Reduce task
graph overhead
by 20 x
Research Centre Juelich
Scalability to 2000 Processors (Fall 2001)
ASCI Nirvana
SGI Origin 2000
Los Alamos
National Laboratory
April 9, 2002
Research Centre Juelich
Performance Tracking and Reporting


Integrated performance measurement allows
performance analysis throughout development lifetime
Applied performance engineering in software design and
development (software engineering) process





Create “performance portfolio” from regular performance
experimentation (coupled with software testing)
Use performance knowledge in making key software
design decision, prior to major development stages
Use performance benchmarking and regression testing to
identify irregularities
Support automatic reporting of performance bugs
Cross-platform (cross-generation) evaluation
April 9, 2002
Research Centre Juelich
XPARE - eXPeriment Alerting and REporting

Experiment launcher automates measurement / analysis






Reporting system conducts performance regression tests




Configuration and compilation of performance tools
Uintah instrumentation control for experiment type
Multiple experiment execution
Performance data collection, analysis, and storage
Integrated in Uintah software testing harness
Apply performance difference thresholds (alert ruleset)
Alerts users via email if thresholds have been exceeded
Web alerting setup and full performance data reporting
Historical performance data analysis
April 9, 2002
Research Centre Juelich
XPARE System Architecture
Experiment
Launch
Mail
server
Web
server
Performance
Reporter
April 9, 2002
Performance
Database
Alerting
Setup
Comparison
Tool
Regression
Analyzer
Research Centre Juelich
Alerting Setup
April 9, 2002
Research Centre Juelich
Experiment Results Viewing Selection
April 9, 2002
Research Centre Juelich
Web-Based Experiment Reporting
April 9, 2002
Research Centre Juelich
Web-Based Experiment Reporting (continued)
April 9, 2002
Research Centre Juelich
Web-Based Experiment Reporting (continued)
April 9, 2002
Research Centre Juelich
Performance Analysis Tool Integration


Complex systems pose challenging performance analysis
problems that require robust methodologies and tools
New performance problems will arise





No one performance tool can address all concerns
Look towards an integration of performance technologies


Instrumentation and measurement
Data analysis and presentation
Diagnosis and tuning
Support to link technologies to create performance
problem solving environments
Performance engineering methodology and tool
integration with software design and development
process
April 9, 2002
Research Centre Juelich
Integrated Performance Evaluation Environment
April 9, 2002
Research Centre Juelich
References




A. Malony and S. Shende, “Performance Technology for Complex
Parallel and Distributed Systems,” Proc. 3rd Workshop on Parallel
and Distributed Systems (DAPSYS), pp. 37-46, Aug. 2000.
S. Shende, A. Malony, and R. Ansell-Bell, “Instrumentation and
Measurement Strategies for Flexible and Portable Empirical
Performance Evaluation,” Proc. Int’l. Conf. on Parallel and
Distributed Processing Techniques and Applications (PDPTA),
CSREA, pp. 1150-1156, July 2001.
S. Shende, “The Role of Instrumentation and Mapping in
Performance Measurement,” Ph.D. Dissertation, Univ. of Oregon,
Aug. 2001.
J. de St. Germain, A. Morris, S. Parker, A. Malony, and S. Shende,
“Integrating Performance Analysis in the Uintah Software
Development Cycle,” ISHPC 2002, Nara, Japan, May, 2002.
April 9, 2002
Research Centre Juelich