Integrating Performance Analysis in Complex Scientific Software: Experiences with the Uintah Computational Framework
Download ReportTranscript Integrating Performance Analysis in Complex Scientific Software: Experiences with the Uintah Computational Framework
Integrating Performance Analysis in Complex Scientific Software: Experiences with the Uintah Computational Framework Allen D. Malony [email protected] Department of Computer and Information Science Computational Science Institute University of Oregon Acknowledgements Sameer Shende, Robert Bell University of Oregon Steven Parker, J. Dav de St.-Germain, and Alan Morris University of Utah Department of Energy (DOE), ASCI Academic Strategic Alliances Program (ASAP) April 9, 2002 Center for Simulation of Accidental Fires and Explosions (C-SAFE), ASCI/ASAP Level 1 center, University of Utah, http://www.csafe.utah.edu Computational Science Institute, ASCI/ASAP Level 3 projects with LLNL / LANL, University of Oregon, http://www.csi.uoregon.edu Research Centre Juelich Complex Parallel Systems Complexity in computing system architecture Diverse parallel system architectures shared / distributed memory, cluster, hybrid, NOW, Grid, … Sophisticated processor and memory architectures Advanced network interface and switching architecture Specialization of hardware components Complexity in parallel software environment Diverse parallel programming paradigms shared memory multi-threading, message passing, hybrid Hierarchical, multi-level software architectures Optimizing compilers and sophisticated runtime systems Advanced numerical libraries and application frameworks April 9, 2002 Research Centre Juelich Complexity Drives Performance Need / Technology Observe/analyze/understand performance behavior Multiple levels of software and hardware Different types and detail of performance data Alternative performance problem solving methods Multiple targets of software and system application Robust AND ubiquitous performance technology April 9, 2002 Broad scope of performance observability Flexible and configurable mechanisms Technology integration and extension Cross-platform portability Open, layered, and modular framework architecture Research Centre Juelich What is Parallel Performance Technology? Performance instrumentation tools Performance measurement (observation) tools Profiling and tracing of SW/HW performance events Different software (SW) and hardware (HW) levels Performance analysis tools Different program code levels Different system levels Performance data analysis and presentation Online and offline tools Performance experimentation and data management Performance modeling and prediction tools April 9, 2002 Research Centre Juelich Complexity Challenges for Performance Tools Computing system environment complexity Observation integration and optimization Access, accuracy, and granularity constraints Diverse/specialized observation capabilities/technology Restricted modes limit performance problem solving Sophisticated software development environments April 9, 2002 Programming paradigms and performance models Performance data mapping to software abstractions Uniformity of performance abstraction across platforms Rich observation capabilities and flexible configuration Common performance problem solving methods Research Centre Juelich General Problems How do we create robust and ubiquitous performance technology for the analysis and tuning of parallel and distributed software and systems in the presence of (evolving) complexity challenges? How do we apply performance technology effectively for the variety and diversity of performance problems that arise in the context of complex parallel and distributed computer systems? April 9, 2002 Research Centre Juelich Scientific Software Engineering Modern scientific simulation software is complex Large development teams of diverse expertise Simultaneous development on different system parts Iterative, multi-stage, long-term software development Need support for managing complex software process Software engineering tools for revision control, automated testing, and bug tracking are commonplace Tools for HPC performance engineering are not evaluation (measurement, analysis, benchmarking) optimization (diagnosis, tracking, prediction, tuning) April 9, 2002 Incorporate performance engineering methodology and support by flexible and robust performance tools Research Centre Juelich Computation Model for Performance Technology How to address dual performance technology goals? Robust capabilities + widely available methodologies Contend with problems of system diversity Flexible tool composition/configuration/integration Approaches Restrict computation types / performance problems limited performance technology coverage Base technology on abstract computation model general architecture and software execution features map features/methods to existing complex system types develop capabilities that can adapt and be optimized April 9, 2002 Research Centre Juelich General Complex System Computation Model Node: physically distinct shared memory machine Message passing node interconnection network Context: distinct virtual memory space within node Thread: execution threads (user/system) in context Interconnection Network physical view node memory memory VM space model view … Node SMP memory … Context April 9, 2002 * Node Node message * Inter-node communication Threads Research Centre Juelich Framework for Performance Problem Solving Model-based performance technology Instrumentation / measurement / execution models performance observability constraints performance data types and events Analysis / presentation model performance data processing performance views and model mapping Integration model performance tool component configuration / integration Can a performance problem solving framework be designed based on a general complex system model and with a performance technology model approach? April 9, 2002 Research Centre Juelich TAU Performance System Framework Tuning and Analysis Utilities Performance system framework for scalable parallel and distributed high-performance computing Targets a general complex system computation model nodes / contexts / threads Multi-level: system / software / parallelism Measurement and analysis abstraction Integrated toolkit for performance instrumentation, measurement, analysis, and visualization April 9, 2002 Portable performance profiling/tracing facility Open software approach Research Centre Juelich TAU Performance System Architecture Paraver EPILOG April 9, 2002 Research Centre Juelich Pprof Output (NAS Parallel Benchmark – LU) Intel Quad PIII Xeon, RedHat, PGI F90 F90 + MPICH Profile for: Node Context Thread Application events and MPI events April 9, 2002 Research Centre Juelich jRacy (NAS Parallel Benchmark – LU) n: node c: context t: thread Global profiles Routine profile across all nodes Individual profile April 9, 2002 Research Centre Juelich TAU + PAPI (NAS Parallel Benchmark – LU ) Floating point operations Replaces execution time Only requires re-linking to different TAU library April 9, 2002 Research Centre Juelich TAU + Vampir (NAS Parallel Benchmark – LU) Timeline display Callgraph display Parallelism display Communications display April 9, 2002 Research Centre Juelich Utah ASCI/ASAP Level 1 Center (C-SAFE) C-SAFE was established to build a problem-solving environment (PSE) for the numerical simulation of accidental fires and explosions Fundamental chemistry and engineering physics models Coupled with non-linear solvers, optimization, computational steering, visualization, and experimental data verification Very large-scale simulations Computer science problems: April 9, 2002 Coupling of multiple simulation codes Software engineering across diverse expert teams Achieving high performance on large-scale systems Research Centre Juelich Example C-SAFE Simulation Problems Heptane fire simulation Material stress simulation April 9, 2002 ∑ Typical C-SAFE simulation with a billion degrees of freedom and non-linear time dynamics Research Centre Juelich Uintah Problem Solving Environment Enhanced SCIRun PSE Pure dataflow to component-based Shared memory to scalable multi-/mixed-mode parallelism Interactive only to interactive and standalone Design and implement Uintah component architecture Application programmers provide description of computation (tasks and variables) code to perform task on single “patch” (sub-region of space) Follow Common Component Architecture (CCA) model Design and implement Uintah Computational Framework (UCF) on top of the component architecture April 9, 2002 Research Centre Juelich Uintah High-Level Component View April 9, 2002 Research Centre Juelich Uintah Parallel Component Architecture C-SAFE Problem Specification High Level Architecture Scheduler Subgrid Model Mixing Model Simulation Controller Fluid Model MPM Data Manager Numerical Solvers High Energy Simulations Material Properties Database Post Processing And Analysis Numerical Solvers Parallel Services Resource Management Visualization Database Chemistry Databases Chemistry Database Controller Performance Analysis Non-PSE Components Implicitly Connected to All Components UCF Data Checkpointing PSE Components Control / Light Data Blazer April 9, 2002 Research Centre Juelich Uintah Computational Framework Execution model based on software (macro) dataflow Exposes parallelism and hides data transport latency Computations expressed a directed acyclic graphs of tasks consumes input and produces output (input to future task) input/outputs specified for each patch in a structured grid Abstraction of global single-assignment memory DataWarehouse Directory mapping names to values (array structured) Write value once then communicate to awaiting tasks Task graph gets mapped to processing resources Communications schedule approximates global optimal April 9, 2002 Research Centre Juelich Uintah Task Graph (Material Point Method) Diagram of named tasks (ovals) and data (edges) Imminent computation Dataflow-constrained MPM April 9, 2002 Newtonian material point motion time step Solid: values defined at material point (particle) Dashed: values defined at vertex (grid) Prime (‘): values updated during time step Research Centre Juelich Example Taskgraphs (MPM and Coupled) April 9, 2002 Research Centre Juelich Taskgraph Advantages Accommodates flexible integration needs Accommodates a wide range of unforeseen work loads Accommodates a mix of static and dynamic load balance Manage complexity of mixed-mode programming Simulation time/space coupling Avoids unnecessary transport abstraction overheads Allows uniform abstraction for coordinating coupled models’ time and grid scales Allows application components and framework infrastructure (e.g., scheduler) to evolve independently April 9, 2002 Research Centre Juelich Uintah PSE UCF automatically sets up: Domain decomposition Inter-processor communication with aggregation/reduction Parallel I/O Checkpoint and restart Performance measurement and analysis (stay tuned) Software engineering April 9, 2002 Coding standards CVS (Commits: Y3 - 26.6 files/day, Y4 - 29.9 files/day) Correctness regression testing with bugzilla bug tracking Nightly build (parallel compiles) 170,000 lines of code (Fortran and C++ tasks supported) Research Centre Juelich Performance Technology Integration Uintah present challenges to performance integration Software diversity and structure UCF middleware, simulation code modules component-based hierarchy Portability objectives cross-language and cross-platform multi-parallelism: thread, message passing, mixed Scalability objectives High-level programming and execution abstractions Requires flexible and robust performance technology Requires support for performance mapping April 9, 2002 Research Centre Juelich Performance Analysis Objectives for Uintah Micro tuning Optimization of simulation code (task) kernels for maximum serial performance Scalability tuning Identification of parallel execution bottlenecks overheads: scheduler, data warehouse, communication load imbalance Adjustment of task graph decomposition and scheduling Performance tracking Understand performance impacts of code modifications Throughout course of software development C-SAFE April 9, 2002 application and UCF software Research Centre Juelich Uintah Performance Engineering Approach Contemporary performance methodology focuses on control flow (function) level measurement and analysis C-SAFE application involves coupled-models with taskbased parallelism and dataflow control constraints Performance engineering on algorithmic (task) basis Observe performance based on algorithm (task) semantics Analyze task performance characteristics in relation to other simulation tasks and UCF components scientific component developers can concentrate on performance improvement at algorithmic level UCF developers can concentrate on bottlenecks not directly associated with simulation module code April 9, 2002 Research Centre Juelich Task Execution in Uintah Parallel Scheduler Profile methods and functions in scheduler and in MPI library Task execution time dominates (what task?) Task execution time distribution MPI communication overheads (where?) Need to map performance data! April 9, 2002 Research Centre Juelich Semantics-Based Performance Mapping Associate performance measurements with high-level semantic abstractions Need mapping support in the performance measurement system to assign data correctly April 9, 2002 Research Centre Juelich Hypothetical Mapping Example Particles distributed on surfaces of a cube Particle* P[MAX]; /* Array of particles */ int GenerateParticles() { /* distribute particles over all faces of the cube */ for (int face=0, last=0; face < 6; face++){ /* particles on this face */ int particles_on_this_face = num(face); for (int i=last; i < particles_on_this_face; i++) { /* particle properties are a function of face */ P[i] = ... f(face); ... } last+= particles_on_this_face; } } April 9, 2002 Research Centre Juelich Hypothetical Mapping Example (continued) int ProcessParticle(Particle *p) { /* perform some computation on p */ } int main() { GenerateParticles(); /* create a list of particles */ for (int i = 0; i < N; i++) /* iterates over the list */ ProcessParticle(P[i]); } How much time is spent processing face i particles? What is the distribution of performance among faces? How is this determined if execution is parallel? April 9, 2002 Research Centre Juelich Semantic Entities/Attributes/Associations (SEAA) New dynamic mapping scheme (S. Shende, Ph.D. thesis) Contrast with ParaMap (Miller and Irvin) Entities defined at any level of abstraction Attribute entity with semantic information Entity-to-entity associations Two association types (implemented in TAU API) Embedded – extends data structure of associated object to store performance measurement entity External – creates an external look-up table using address of object as the key to locate performance measurement entity April 9, 2002 Research Centre Juelich No Performance Mapping versus Mapping Typical performance tools report performance with respect to routines Does not provide support for mapping TAU (no mapping) April 9, 2002 Performance tools with SEAA mapping can observe performance with respect to scientist’s programming and problem abstractions TAU (w/ mapping) Research Centre Juelich Uintah Task Performance Mapping Uintah partitions individual particles across processing elements (processes or threads) Simulation tasks in task graph work on particles Tasks have domain-specific character in the computation “interpolate particles to grid” in Material Point Method Task instances generated for each partitioned particle set Execution scheduled with respect to task dependencies How to attributed execution time among different tasks Assign semantic name (task type) to a task instance SerialMPM::interpolateParticleToGrid Map TAU timer object to (abstract) task (semantic entity) Look up timer object using task type (semantic attribute) Further partition along different domain-specific axes April 9, 2002 Research Centre Juelich Task Performance Mapping Instrumentation void MPIScheduler::execute(const ProcessorGroup * pc, DataWarehouseP & old_dw, DataWarehouseP & dw ) { ... TAU_MAPPING_CREATE( task->getName(), "[MPIScheduler::execute()]", (TauGroup_t)(void*)task->getName(), task->getName(), 0); ... TAU_MAPPING_OBJECT(tautimer) TAU_MAPPING_LINK(tautimer,(TauGroup_t)(void*)task->getName()); // EXTERNAL ASSOCIATION ... TAU_MAPPING_PROFILE_TIMER(doitprofiler, tautimer, 0) TAU_MAPPING_PROFILE_START(doitprofiler,0); task->doit(pc); TAU_MAPPING_PROFILE_STOP(0); ... } April 9, 2002 Research Centre Juelich Task Performance Mapping (Profile) Mapped task performance across processes Performance mapping for different tasks April 9, 2002 Research Centre Juelich Task Performance Mapping (Trace) Work packet computation events colored by task type April 9, 2002 Distinct phases of computation can be identifed based on task Research Centre Juelich Task Performance Mapping (Trace - Zoom) Startup communication imbalance April 9, 2002 Research Centre Juelich Task Performance Mapping (Trace - Parallelism) Communication / load imbalance April 9, 2002 Research Centre Juelich Comparing Uintah Traces for Scalability Analysis 8 processes 32 processes 32 processes April 9, 2002 8 processes Research Centre Juelich Scaling Performance Optimizations Last year: initial “correct” scheduler Reduce communication by 10 x ASCI Nirvana SGI Origin 2000 Los Alamos National Laboratory April 9, 2002 Reduce task graph overhead by 20 x Research Centre Juelich Scalability to 2000 Processors (Fall 2001) ASCI Nirvana SGI Origin 2000 Los Alamos National Laboratory April 9, 2002 Research Centre Juelich Performance Tracking and Reporting Integrated performance measurement allows performance analysis throughout development lifetime Applied performance engineering in software design and development (software engineering) process Create “performance portfolio” from regular performance experimentation (coupled with software testing) Use performance knowledge in making key software design decision, prior to major development stages Use performance benchmarking and regression testing to identify irregularities Support automatic reporting of performance bugs Cross-platform (cross-generation) evaluation April 9, 2002 Research Centre Juelich XPARE - eXPeriment Alerting and REporting Experiment launcher automates measurement / analysis Reporting system conducts performance regression tests Configuration and compilation of performance tools Uintah instrumentation control for experiment type Multiple experiment execution Performance data collection, analysis, and storage Integrated in Uintah software testing harness Apply performance difference thresholds (alert ruleset) Alerts users via email if thresholds have been exceeded Web alerting setup and full performance data reporting Historical performance data analysis April 9, 2002 Research Centre Juelich XPARE System Architecture Experiment Launch Mail server Web server Performance Reporter April 9, 2002 Performance Database Alerting Setup Comparison Tool Regression Analyzer Research Centre Juelich Alerting Setup April 9, 2002 Research Centre Juelich Experiment Results Viewing Selection April 9, 2002 Research Centre Juelich Web-Based Experiment Reporting April 9, 2002 Research Centre Juelich Web-Based Experiment Reporting (continued) April 9, 2002 Research Centre Juelich Web-Based Experiment Reporting (continued) April 9, 2002 Research Centre Juelich Performance Analysis Tool Integration Complex systems pose challenging performance analysis problems that require robust methodologies and tools New performance problems will arise No one performance tool can address all concerns Look towards an integration of performance technologies Instrumentation and measurement Data analysis and presentation Diagnosis and tuning Support to link technologies to create performance problem solving environments Performance engineering methodology and tool integration with software design and development process April 9, 2002 Research Centre Juelich Integrated Performance Evaluation Environment April 9, 2002 Research Centre Juelich References A. Malony and S. Shende, “Performance Technology for Complex Parallel and Distributed Systems,” Proc. 3rd Workshop on Parallel and Distributed Systems (DAPSYS), pp. 37-46, Aug. 2000. S. Shende, A. Malony, and R. Ansell-Bell, “Instrumentation and Measurement Strategies for Flexible and Portable Empirical Performance Evaluation,” Proc. Int’l. Conf. on Parallel and Distributed Processing Techniques and Applications (PDPTA), CSREA, pp. 1150-1156, July 2001. S. Shende, “The Role of Instrumentation and Mapping in Performance Measurement,” Ph.D. Dissertation, Univ. of Oregon, Aug. 2001. J. de St. Germain, A. Morris, S. Parker, A. Malony, and S. Shende, “Integrating Performance Analysis in the Uintah Software Development Cycle,” ISHPC 2002, Nara, Japan, May, 2002. April 9, 2002 Research Centre Juelich