The TAU Performance System

Download Report

Transcript The TAU Performance System

Performance Technology for Productive,
High-End Parallel Computing
Allen D. Malony
[email protected]
Department of Computer and Information Science
Performance Research Laboratory
University of Oregon
Outline of Talk

Performance problem solving




TAU parallel performance system and advances
Performance data management and data mining




ORNL
Performance Data Management Framework (PerfDMF)
PerfExplorer
Multi-experiment case studies


Scalability, productivity, and performance technology
Application-specific and autonomic performance tools
Comparative analysis (PERC tool study)
Clustering analysis
Future work and concluding remarks
Performance Technology for Productive, High-End Parallel Computing
2
Research Motivation

Tools for performance problem solving


Empirical-based performance optimization process
Performance technology concerns
Performance
Technology
• Experiment
management
• Performance
storage
Performance
Tuning
hypotheses
Performance
Diagnosis
properties
Performance
Experimentation
characterization
Performance
Observation
ORNL
Performance
Technology
• Instrumentation
• Measurement
• Analysis
• Visualization
Performance Technology for Productive, High-End Parallel Computing
3
Challenges in Performance Problem Solving



How to make the process more effective (productive)?
Process may depend on scale of parallel system
What are the important events and performance metrics?



Process and tools can/must be more application-aware




ORNL
Tied to application structure and computational model
Tied to application domain and algorithms
Tools have poor support for application-specific aspects
What are the significant issues that will affect the
technology used to support the process?
Enhance application development and benchmarking
New paradigm in performance process and technology
Performance Technology for Productive, High-End Parallel Computing
4
Large Scale Performance Problem Solving




How does our view of this process change when we
consider very large-scale parallel systems?
What are the significant issues that will affect the
technology used to support the process?
Parallel performance observation is clearly needed
In general, there is the concern for intrusion


Scaling complicates observation and analysis



ORNL
Seen as a tradeoff with performance diagnosis accuracy
Performance data size becomes a concern
Analysis complexity increases
Nature of application development may change
Performance Technology for Productive, High-End Parallel Computing
5
Role of Intelligence, Automation, and Knowledge




Scale forces the process to become more intelligent
Even with intelligent and application-specific tools, the
decisions of what to analyze is difficult and intractable
More automation and knowledge-based decision making
Build autonomic capabilities into the tools






ORNL
Support broader experimentation methods and refinement
Access and correlate data from several sources
Automate performance data analysis / mining / learning
Include predictive features and experiment refinement
Knowledge-driven adaptation and optimization guidance
Address scale issues through increased expertise
Performance Technology for Productive, High-End Parallel Computing
6
TAU Performance System


Tuning and Analysis Utilities (13+ year project effort)
Performance system framework for HPC systems


Targets a general complex system computation model






ORNL
Entities: nodes / contexts / threads
Multi-level: system / software / parallelism
Measurement and analysis abstraction
Integrated toolkit for performance problem solving


Integrated, scalable, flexible, and parallel
Instrumentation, measurement, analysis, and visualization
Portable performance profiling and tracing facility
Performance data management and data mining
University of Oregon , Research Center Jülich, LANL
Performance Technology for Productive, High-End Parallel Computing
7
TAU Parallel Performance System Goals

Multi-level performance instrumentation



Flexible and configurable performance measurement
Widely-ported parallel performance profiling system





ORNL
Computer system architectures and operating systems
Different programming languages and compilers
Support for multiple parallel programming paradigms


Multi-language automatic source instrumentation
Multi-threading, message passing, mixed-mode, hybrid
Support for performance mapping
Support for object-oriented and generic programming
Integration in complex software, systems, applications
Performance Technology for Productive, High-End Parallel Computing
8
TAU Performance System Architecture
ORNL
Performance Technology for Productive, High-End Parallel Computing
9
TAU Performance System Architecture
ORNL
Performance Technology for Productive, High-End Parallel Computing
10
Advances in TAU Instrumentation

Source instrumentation

Program Database Toolkit (PDT)
 automated
Fortran 90/95 support (Flint parser, very robust)
 statement level support in C/C++ (Fortran soon)


TAU_COMPILER to automate instrumentation process
Automatic proxy generation for component applications
 automatic





ORNL
CCA component instrumentation
Python instrumentation and automatic instrumentation
Continued integration with dynamic instrumentation
Update of OpenMP instrumentation (POMP2)
Selective instrumentation and overhead reduction
Improvements in performance mapping instrumentation
Performance Technology for Productive, High-End Parallel Computing
11
Advances in TAU Measurement

Profiling

Memory profiling
 global

heap memory tracking (several options)
Callpath profiling
 user-controllable






ORNL
Phase-based profiling
Online profile access
Tracing


calling depth
Generation of VTF3 traces files (fully portable)
Inclusion of hardware performance counts in trace files
Hierarchical trace merging
Online performance overhead compensation
Component software proxy generation and monitoring
Performance Technology for Productive, High-End Parallel Computing
12
Profile Measurement – Three Flavors

Flat profiles



Callpath Profiles




Time spent along a calling path (edges in callgraph)
“main=> f1 => f2 => MPI_Send”
TAU_CALLPATH_DEPTH environment variable)
Phase-based profiles



ORNL
Time (or counts) spent in each routine (nodes in
callgraph)
Exclusive/inclusive time, # of calls, child calls
Flat profiles under a phase (nested phases are allowed)
Default “main” phase
Supports static or dynamic (per-iteration) phases
Performance Technology for Productive, High-End Parallel Computing
13
Advances in TAU Performance Analysis

Enhanced parallel profile analysis (ParaProf)



Callpath analysis integration in ParaProf
Event callgraph view
Performance Data Management Framework (PerfDMF)


First release of prototype
In use by several groups
 S.

Integration with Vampir Next Generation (VNG)



ORNL
Moore (UTK), P. Teller (UTEP), P. Hovland (ANL), …
Online trace analysis
Performance visualization (ParaVis) prototype
Component performance modeling and QoS
Performance Technology for Productive, High-End Parallel Computing
14
Flat Profile – Pprof (NPB LU)




Intel Linux
cluster
F90 +
MPICH
Profile
- Node
- Context
- Thread
Events
- code
- MPI
ORNL
Performance Technology for Productive, High-End Parallel Computing
15
Flat Profile – ParaProf (Miranda)
ORNL
Performance Technology for Productive, High-End Parallel Computing
16
Callpath Profile (Flash)
ORNL
Performance Technology for Productive, High-End Parallel Computing
17
Callpath Profile
21-level
callpath
ORNL
Performance Technology for Productive, High-End Parallel Computing
18
Phase Profile – Dynamic Phases
In 51st iteration, time
spent in MPI_Waitall
was 85.81 secs
Total time spent in
MPI_Waitall was
4137.9 secs across all
92 iterations
ORNL
Performance Technology for Productive, High-End Parallel Computing
19
ParaProf – Manager
performance
database
derived performance metrics
ORNL
Performance Technology for Productive, High-End Parallel Computing
20
ParaProf – Histogram View (Miranda)
8k processors
16k processors
ORNL
Performance Technology for Productive, High-End Parallel Computing
21
ParaProf – Stacked View (Miranda)
ORNL
Performance Technology for Productive, High-End Parallel Computing
22
ParaProf – Full Callgraph View (MFIX)
ORNL
Performance Technology for Productive, High-End Parallel Computing
23
ParaProf – Callpath Highlighting (Flash)
ORNL
Performance Technology for Productive, High-End Parallel Computing
24
ParaProf – Callgraph Zoom (Flash)
ORNL
Performance Technology for Productive, High-End Parallel Computing
25
Profiling of Miranda on BG/L (Miller, LLNL)


Profile code performance (automatic instrumentation)
Scaling studies (problem size, number of processors)
128 Nodes

ORNL
512 Nodes
1024 Nodes
Run on 8K and 16K processors!
Performance Technology for Productive, High-End Parallel Computing
26
Fine Grained Profiling via Tracing on Miranda

Use TAU to generate VTF3 traces for Vampir analysis


ORNL
Combines MPI calls with HW counter information
Detailed code behavior to focus optimization efforts
Performance Technology for Productive, High-End Parallel Computing
27
Memory Usage Analysis


BG/L will have limited memory per node (512 MB)
Miranda uses TAU to profile memory usage



TAU’s footprint
is small

ORNL
Streamlines code
Squeeze larger
problems on the
machine
Approximately
100 bytes per event
per thread
Max Heap Memory (KB) used for 1283 problem
on 16 processors of ASC Frost at LLNL
Performance Technology for Productive, High-End Parallel Computing
28
TAU Performance System Status

Computing platforms (selected)


Programming languages


pthreads, SGI sproc, Java,Windows, OpenMP
Compilers (selected)

ORNL
C, C++, Fortran 77/90/95, HPF, Java, OpenMP, Python
Thread libraries


IBM SP / pSeries, SGI Origin 2K/3K, Cray T3E / SV-1 /
X1, HP (Compaq) SC (Tru64), Sun, Hitachi SR8000,
NEC SX-5/6, Linux clusters (IA-32/64, Alpha, PPC, PARISC, Power, Opteron), Apple (G4/5, OS X), Windows
Intel KAI (KCC, KAP/Pro), PGI, GNU, Fujitsu, Sun,
Microsoft, SGI, Cray, IBM (xlc, xlf), HP, NEC, Absoft
Performance Technology for Productive, High-End Parallel Computing
29
Important Questions for Application Developers










ORNL
How does performance vary with different compilers?
Is poor performance correlated with certain OS features?
Has a recent change caused unanticipated performance?
How does performance vary with MPI variants?
Why is one application version faster than another?
What is the reason for the observed scaling behavior?
Did two runs exhibit similar performance?
How are performance data related to application events?
Which machines will run my code the fastest and why?
Which benchmarks predict my code performance best?
Performance Technology for Productive, High-End Parallel Computing
30
Performance Problem Solving Goals

Answer questions at multiple levels of interest

Data from low-level measurements and simulations
 use

to predict application performance
High-level performance data spanning dimensions
 machine,
applications, code revisions, data sets
 examine broad performance trends



ORNL
Discover general correlations application performance
and features of their external environment
Develop methods to predict application performance on
lower-level metrics
Discover performance correlations between a small set
of benchmarks and a collection of applications that
represent a typical workload for a given system
Performance Technology for Productive, High-End Parallel Computing
31
Automatic Performance Analysis Tool (Concept)
PerfTrack
Performance
Database
PSU: Kathryn Mohror, Karen Karavanic
UO: Kevin Huck
LLNL: John May, Brian Miller (CASC)
ORNL
Performance Technology for Productive, High-End Parallel Computing
32
Performance Data Management Framework
ORNL
Performance Technology for Productive, High-End Parallel Computing
33
TAU Performance Regression (PerfRegress)


ORNL
Prototype developed by Alan Morris for Uintah
Re-implement using PerfDMF
Performance Technology for Productive, High-End Parallel Computing
34
ParaProf Performance Profile Analysis
Raw files
PerfDMF
managed
(database)
HPMToolkit
Metadata
MpiP
Application
Experiment
Trial
TAU
ORNL
Performance Technology for Productive, High-End Parallel Computing
35
PerfExplorer (K. Huck, UO)

Performance knowledge discovery framework

Use the existing TAU infrastructure
 TAU



instrumentation data, PerfDMF
Client-server based system architecture
Data mining analysis applied to parallel performance data
Technology integration




Relational DatabaseManagement Systems (RDBMS)
Java API and toolkit
R-project / Omegahat statistical analysis
Web-based client
 Jakarta
ORNL
web server and Struts (for a thin web-client)
Performance Technology for Productive, High-End Parallel Computing
36
PerfExplorer Architecture
Server accepts
multiple client
requests and
returns results
Server supports
R data mining
operations built
using RSJava
PerfDMF Java
API used to
access DBMS
via JDBC
Client is a
traditional Java
application with
GUI (Swing)
Analyses can be
scripted,
parameterized,
and monitored
Browsing of analysis results via automatic
web page creation and thumbnails
ORNL
Performance Technology for Productive, High-End Parallel Computing
37
PERC Tool Requirements and Evaluation

Performance Evaluation Research Center (PERC)



PERC tools study (led by ORNL, Pat Worley)




In-depth performance analysis of select applications
Evaluation performance analysis requirements
Test tool functionality and ease of use
Applications



ORNL
DOE SciDAC
Evaluation methods/tools for high-end parallel systems
Start with fusion code – GYRO
Repeat with other PERC benchmarks
Continue with SciDAC codes
Performance Technology for Productive, High-End Parallel Computing
38
GYRO Execution Parameters

Three benchmark problems




Test different methods to evaluate nonlinear terms:





ORNL
B1-std : 16n processors, 500 timesteps
B2-cy : 16n processors, 1000 timesteps
B3-gtc : 64n processors, 100 timesteps (very large)
Direct method
FFT (“nl2” for B1 and B2, “nl1” for B3)
Task affinity enabled/disabled (p690 only)
Memory affinity enabled/disabled (p690 only)
Filesystem location (Cray X1 only)
Performance Technology for Productive, High-End Parallel Computing
39
Primary Evaluation Machines

Phoenix (ORNL – Cray X1)


Ram (ORNL – SGI Altix (1.5 GHz Itanium2))


864 total processors on 27 compute nodes
Seaborg (NERSC – IBM SP3)

ORNL
~7,738 total processors on 15 machines at 9 sites
Cheetah (ORNL – p690 cluster (1.3 GHz, HPS))


256 total processors
TeraGrid


512 multi-streaming vector processors
6080 total processors on 380 compute nodes
Performance Technology for Productive, High-End Parallel Computing
40
Region (Events) of Interest








Total program is measured, plus specific code regions
NL
: nonlinear advance
NL_tr* : transposes before / after nonlinear advance
Coll
: collisions
Coll_tr* : transposes before/after main collision routine
Lin_RHS : compute right hand side of the electron and
ion GKEs (GyroKinetic (Vlasov) Equations)
Field
: explicit or implicit advance of fields and
solution of explicit maxwell equations
I/O, extras
Communication
ORNL
Performance Technology for Productive, High-End Parallel Computing
41
Data Collected Thus Far…

User timer data


Self instrumentation in the GYRO application
Outputs aggregate data per N timesteps
N
= 50 (B1, B3)
 N = 125 (B2)

HPM (Hardware Performance Monitor) data


MPICL profiling/tracing



ORNL
IBM platform (p690) only
Cray X1 and IBM p690
TAU (all platforms, profiling/tracing, in progress)
Data processed by hand into Excel spreadsheets
Performance Technology for Productive, High-End Parallel Computing
42
PerfExplorer Analysis of Self-Instrumented Data

PerfExplorer



Focus on comparative analysis
Apply to PERC tool evaluation study
Look at user timer data

Aggregate data
 no
per process data
 process clustering analysis is not applicable

Timings output every N timesteps
 some

Goal

ORNL
phase analysis possible
Recreate manually generated performance reports
Performance Technology for Productive, High-End Parallel Computing
43
Comparative Analysis

Supported analysis


Timesteps per second
Relative speedup and efficiency
 For
entire application (compare machines, parameters, etc.)
 For all events (on one machine, one set of parameters)
 For one event (compare machines, parameters, etc.)




Initial analysis implemented as scalability study
Future analysis


ORNL
Fraction of total runtime for one group of events
Runtime breakdown (as a percentage)
Arbitrary organization
Parametric studies
Performance Technology for Productive, High-End Parallel Computing
44
PerfExplorer Interface
Experiment
metadata
Select experiments
and trials of interest
Data organized in application,
experiment, trial structure
(will allow arbitrary in future)
ORNL
Performance Technology for Productive, High-End Parallel Computing
45
PerfExplorer Interface
Select analysis
ORNL
Performance Technology for Productive, High-End Parallel Computing
46
Timesteps per Second




Cray X1 is the fastest to
solution in all 3 tests
FFT (nl2) improves time for
B3-gtc only
TeraGrid faster than p690 for
B1-std?
Plots generated automatically
B2-cy
B1-std
B1-std
TeraGrid
B3-gtc
B3-gtc
ORNL
Performance Technology for Productive, High-End Parallel Computing
47
Relative Efficiency (B1-std)

By experiment (B1-std)


By event for one experiment


Total runtime (Cheetah (red))
Coll_tr (blue) is significant
By experiment for one event

Shows how Coll_tr behaves
for all experiments
Cheetah
Coll_tr
16 processor
base case
ORNL
Performance Technology for Productive, High-End Parallel Computing
48
Relative Speedup (B2-cy)

By experiment (B2-cy)


By event for one experiment


NL_tr (orange) is significant
By experiment for one event

ORNL
Total runtime (X1 (blue))
Shows how NL_tr behaves
for all experiments
Performance Technology for Productive, High-End Parallel Computing
49
Fraction of Total Runtime (Communication)


IBM SP3 (cyan) has the
highest fraction of total time
spent in communication for
all three benchmarks
Cray X1 has the lowest
fraction in communication
B2-cy
ORNL
B1-std
B3-gtc
Performance Technology for Productive, High-End Parallel Computing
50
Runtime Breakdown on IBM SP3



ORNL
Communications grows as a percentage of total as the
application scales (colors match in graphs)
Both Coll_tr (blue) and NL_tr (orange) scale poorly
I/O (green) scales poorly, but its percentage of total
runtime is small
Performance Technology for Productive, High-End Parallel Computing
51
Phase Analysis

Breakdown by phase shows variability from beginning
of application to final solution



ORNL
Relative efficiency and runtime breakdown
Iteration 6 (cyan) has big drop in efficiency for 128
Greater variability in higher processor counts
Performance Technology for Productive, High-End Parallel Computing
52
Clustering Analysis



“Scalable Analysis Techniques for Microprocessor
Performance Counter Metrics,” Ahn and Vetter, SC2002
Applied multivariate statistical analysis techniques to
large datasets of performance data (PAPI events)
Cluster Analysis and F-Ratio



Factor Analysis


ORNL
Agglomerative Hierarchical Method - dendogram
identified groupings of master, slave threads in sPPM
K-means clustering and F-ratio - differences between
master, slave related to communication and management
shows highly correlated metrics fall into peer groups
Combined techniques (recursively) leads to observations
of application behavior hard to identify otherwise
Performance Technology for Productive, High-End Parallel Computing
53
Similarity Analysis


Can we recreate Ahn and Vetter’s results?
Apply techniques from the phase analysis (Sherwood)



Threads of execution can be compared for similarity
Threads with abnormal behavior show up as less similar
Each thread is represented as a vector (V) of dimension n

n is the number of functions in the application
V = [f1, f2, …, fn]

Each value is the percentage of time spent in that function
 normalized

(represent event mix)
from 0.0 to 1.0
Distance calculated between the vectors U and V:
n
ManhattanDistance(U, V) = ∑ |ui - vi|
i=0
ORNL
Performance Technology for Productive, High-End Parallel Computing
54
sPPM on Blue Horizon (64x4, OpenMP+MPI)
• TAU profiles
• 10 events
• PerfDMF
• threads 32-47
ORNL
Performance Technology for Productive, High-End Parallel Computing
55
sPPM on MCR (total instructions, 16x2)
• TAU/PerfDMF
• 120 events
• master (even)
• worker (odd)
ORNL
Performance Technology for Productive, High-End Parallel Computing
56
sPPM on MCR (PAPI_FP_INS, 16x2)
• TAU profiles
• PerfDMF
• master/worker
• higher/lower
Same result as Ahn/Vetter
ORNL
Performance Technology for Productive, High-End Parallel Computing
57
sPPM on Frost (PAPI_FP_INS, 256 threads)


View of fewer than half of
the threads of execution is
possible on the screen at
one time
Three groups are obvious:



Lower ranking threads
One unique thread
Higher ranking threads
 3%

ORNL
more FP
Finding subtle differences
is difficult with this view
Performance Technology for Productive, High-End Parallel Computing
58
sPPM on Frost (PAPI_FP_INS, 256 threads)

Dendrogram shows 5 natural clusters:
 Unique thread
 High ranking master threads
 Low ranking master threads
 High ranking worker threads
 Low ranking worker threads
• TAU profiles
• PerfDMF
• R direct
access to DM
• R routine
threads
ORNL
Performance Technology for Productive, High-End Parallel Computing
59
sPPM on MCR (PAPI_FP_INS, 16x2 threads)
masters
slaves
ORNL
Performance Technology for Productive, High-End Parallel Computing
60
sPPM on Frost (PAPI_FP_INS, 256 threads)

After K-means clustering into 5 clusters



Similar clusters are formed (seed with group means)
Each cluster’s performance characteristics analyzed
Dimensionality reduction (256 threads to 5 clusters!)
SPPM
10
ORNL
119
1
6
INTERF
DIFUZE
DINTERF
Barrier [OpenMP:runhyd3.F <604,0>]
120
Performance Technology for Productive, High-End Parallel Computing
61
Current and Future Work

ParaProf


PerfDMF






ORNL
Adding new database backends and distributed support
Building support for user-created tables
PerfExplorer


Developing 3D performance displays
Extending comparative and clustering analysis
Adding new data mining capabilities
Building in scripting support
Performance regression testing tool (PerfRegress)
Integrate in Eclipse Parallel Tool Project (PTP)
Performance Technology for Productive, High-End Parallel Computing
62
Concluding Discussion
Performance tools must be used effectively
 More intelligent performance systems for productive use






Performance observation methods do not necessarily
need to change in a fundamental sense


ORNL
Evolve to application-specific performance technology
Deal with scale by “full range” performance exploration
Autonomic and integrated tools
Knowledge-based and knowledge-driven process
More automatically controlled and efficiently use
Develop next-generation tools and deliver to community
Performance Technology for Productive, High-End Parallel Computing
63
Support Acknowledgements


Department of Energy (DOE)
 Office of Science contracts
 University of Utah ASCI Level 1
sub-contract
 ASC/NNSA Level 3 contract
NSF



ORNL
High-End Computing Grant
Qu i c k Ti m e ™ a n d a
TIF F (Un c o m p re s s e d ) d e c o m p re s s o r
a re n e e d e d to s e e th i s p i c tu re .
Research Centre Juelich
 John von Neumann Institute
 Dr. Bernd Mohr
Los Alamos National Laboratory
Performance Technology for Productive, High-End Parallel Computing
64