Computational Informatics for Brain Electromagnetic Research
Download
Report
Transcript Computational Informatics for Brain Electromagnetic Research
TAU: A Framework for Parallel
Performance Analysis
Allen D. Malony
[email protected]
ParaDucks Research Group
Computer & Information Science Department
Computational Science Institute
University of Oregon
Outline
Goals and challenges
Targeted research areas
TAU (Tuning and Analysis Utilities)
Tools associated with TAU
computation model, architecture, toolkit framework
performance system technology
examples of TAU use
PDT (Program Database Toolkit)
distributed runtime monitoring
Future plans
Conclusions
July 7, 2015
Ptools Annual Meeting
Goal and Challenges
Create robust (performance) technology for the
analysis and tuning of parallel software and systems
Challenges
July 7, 2015
different scalable computing platforms
different programming languages and systems
common, portable framework for analysis
extensibe, retargetable tool technology
complex set of requirements
Ptools Annual Meeting
Targeted Research Areas
Performance analysis for scalable parallel systems
targeting multiple programming and system levels
and the mapping between levels
Program code analysis for multiple languages
enabling development of new source-based tools
Integration and interoperation support for building
analysis tool frameworks and environments
Runtime tool interaction for dynamic applications
July 7, 2015
Ptools Annual Meeting
TAU (Tuning and Analysis Utilities)
Performance analysis framework for scalable parallel
and distributed high-performance computing
Target a general parallel computation model
computer nodes
shared address space contexts
threads of execution
multi-level parallelism
Network
Integrated toolkit for performance instrumentation,
measurement, analysis, and visualization
July 7, 2015
portable performance profiling/tracing facility
open software approach
Ptools Annual Meeting
TAU Architecture
July 7, 2015
Ptools Annual Meeting
TAU Instrumentation
Flexible, multiple instrumentation mechanisms
source code
manual
automatic
using PDT (tau_instrumentor)
object code
pre-instrumented
libraries
statically linked: MPI wrapper library using the MPI
Profiling Interface (libTauMpi.a)
dynamically linked: Java instrumentation using JVMPI
and TAU shared object dynamically loaded in VM
executable code
dynamic
July 7, 2015
instrumentation using DyninstAPI (tau_run)
Ptools Annual Meeting
TAU Instrumentation (continued)
Common target measurement interface (TAU API)
C++ (object-based) instrumentation
macro-based, using constructor/destructor techniques
function, classes, and templates
uniquely identify functions and templates
name
and type signature (name registration)
static object creates performance entry
dynamic object receives static object pointer
runtime type identification for template instantiations
with C and Fortran instrumentation variants
Instrumentation optimization
July 7, 2015
Ptools Annual Meeting
TAU Measurement
Performance information
high resolution timer library (real-time clock)
generalized software counter library
hardware performance counters
PCL (Performance
Counter Library) (ZAM, Germany)
PAPI (Performance API) (UTK, Ptools)
consistent, portable API
Organization
July 7, 2015
node, context, thread levels
profile groups for collective events (runtime selective)
mapping between software levels
Ptools Annual Meeting
TAU Measurement (continued)
Profiling
Tracing
function-level, block-level, statement-level
supports user-defined events
TAU profile (function) database (PD)
function callstack
hardware counts instead of time
profile-level events
interprocess communication events
timestamp synchronization
User-controlled configuration (configure)
July 7, 2015
Ptools Annual Meeting
Timing of Multi-threaded Applications
Capture timing information on per thread basis
Two alternative
wall clock time
works
on all systems
user-level measurement
OS-maintained CPU time (e.g., Solaris, Linux)
thread
virtual time measurement
TAU supports both alternatives
CPUTIME module profiles user+system time
% configure -pthread -CPUTIME
July 7, 2015
Ptools Annual Meeting
TAU Analysis
Profile analysis
pprof
parallel
profiler with text-based display
racy
graphical
interface to pprof
Trace analysis
trace merging and clock adjustment (if necessary)
trace format conversion (ALOG, SDDF, PV, Vampir)
Vampir
trace
July 7, 2015
analysis and visualization tool (Pallas)
Ptools Annual Meeting
TAU Status
Usage
platforms
IBM
SP, SGI Origin 2K, Intel Teraflop, Cray T3E, HP,
Sun, Windows 95/98/NT, Alpha/Pentium Linux cluster
languages
C,
C++, Fortran 77/90, HPF, pC++, HPC++, Java
communication libraries
MPI,
PVM, Nexus, Tulip, ACLMPL
thread libraries
pthreads, Tulip,
compilers
KAI,
July 7, 2015
SMARTS, Java,Windows
PGI, GNU, Fujitsu, Sun, Microsoft, SGI, Cray
Ptools Annual Meeting
TAU Status (continued)
application libraries
Blitz++, A++/P++, ACLVIS,
application frameworks
POOMA,
PAWS
POOMA-2, MC++, Conejo, PaRP
other projects
ACPC,
University of Vienna: Aurora
UC Berkeley (Culler): Millenium, sensitivity analysis
KAI and Pallas
TAU profiling and tracing toolkit (Version 2.7)
July 7, 2015
LANL ACL Fall 1999 CD-ROM distributed at SC'99
Extensive 70-page TAU User’s Guide
http://www.acl.lanl.gov/tau
Ptools Annual Meeting
TAU Examples
Instrumentation
Measurement
C++ template profiling (PETE, Blitz++)
Java and MPI
PAPI
mapping of asynchronous execution (SMARTS)
hybrid execution (Opus/HPF)
Analysis
July 7, 2015
SMARTS scheduling
Ptools Annual Meeting
C++ Template Instrumentation (Blitz++, PETE)
High-level objects
Optimizations
array classes
templates
array processing
expressions (PETE)
Array expressions
Relate performance
data to high-level
statement
Complexity of
template evaluation
July 7, 2015
Ptools Annual Meeting
Standard Template Instrumentation Difficulties
Instantiated templates result in mangled identifiers
Standard profiling techniques and tools are deficient
integrated with proprietary compilers
specific systems platforms and programming models
Uninterpretable routine names
July 7, 2015
Ptools Annual Meeting
TAU Template Instrumentation and Profiling
Profile of
expression
types
Graphical pprof
July 7, 2015
Performance data presented
with respect to high-level
array expression types
Ptools Annual Meeting
Parallel Java Performance Instrumentation
Multi-language applications (Java, C, C++, Fortran)
Hybrid execution models (Java threads, MPI)
Java Virtual Machine Profiler Interface (JVMPI)
Java Native Interface (JNI)
event instrumentation in JVM
profiler agent (libTAU.so) fields events
invoke JVMPI control routines to control Java threads
and access thread information
MPI profiling interface
“Performance Tools for Parallel Java Environments,”
Java Workshop, ICS 2000, May 2000.
July 7, 2015
Ptools Annual Meeting
TAU Java Instrumentation Architecture
Java program
TAU package
Thread API
Event
notification
JVMPI
JNI
TAU
mpiJava package
MPI profiling interface
TAU wrapper
Native MPI library
Profile DB
July 7, 2015
Ptools Annual Meeting
Parallel Java Game of Life
mpiJava testcase
4 nodes,
28 threads
Node
process
grouping
Thread
message
pairing
Vampir
display
July 7, 2015
Multi-level event grouping
Ptools Annual Meeting
TAU and PAPI: NAS Parallel LU Benchmark
SGI Power Onyx (4 processors, R10K), MPI
Floating point
Percentage
operations
profile
Cross-node
full / routine
profiles
Full FP
profile for
each node
July 7, 2015
Ptools Annual Meeting
TAU and PAPI: Matrix Multiply
Data cache miss comparison,
“regular” vs. “strip-mining” execution
512x512
32 KB (P)
2 MB (S)
Regular
causes
4.5 times
more
misses
July 7, 2015
Ptools Annual Meeting
Asynchronous Performance Analysis (SMARTS)
Scalable Multithreaded Asynchronuous Runtime System
TAU measurement of asynchronous parallel execution
user-level threads, light-weight virtual processors
macro-dataflow, asynchronous execution interleaving
iterates from data-parallel statements
integrated with POOMA II
utilized the TAU mapping API
associate iterate performance with data parallel statement
evaluate different scheduling policies
“SMARTS: Exploting Temporal Locality & Parallelism
through Vertical Execution,” ICS '99, August 1999.
July 7, 2015
Ptools Annual Meeting
TAU Mapping of Asynchronous Execution
Without mapping
Two threads
executing
With mapping
POOMA / SMARTS
July 7, 2015
Ptools Annual Meeting
With and without mapping (Thread 0)
Without mapping
Thread 0 blocks
waiting for iterates
Iterates get lumped together
With mapping
Iterates distinguished
July 7, 2015
Ptools Annual Meeting
With and without mapping (Thread 1)
Array initialization performance lumped
Without mapping
Performance associated with ExpressionKernel object
With mapping
Iterate performance mapped
to array statement
Array initialization performance
correctly separated
July 7, 2015
Ptools Annual Meeting
TAU and Hybrid Execution in Opus/HPF
Fortran 77, Fortran 90, HPF
Vienna Fortran Compiling System
Opus / HPF
combined data (HPF) and task (Opus) parallelism
HPF compiler produces Fortran 90 modules
processes interoperate using Opus runtime system
producer
/ consumer model
MPI and pthreads
July 7, 2015
performance influence at multiple software levels
Ptools Annual Meeting
TAU Profiling of Opus/HPF Application
Multiple producers
Multiple consumers
Parallelism View
July 7, 2015
Ptools Annual Meeting
TAU Profiling of SMARTS
Iteration
scheduling
for two array
expressions
July 7, 2015
Ptools Annual Meeting
SMARTS Tracing (SOR) – Vampir Visualization
SCVE scheduler used in Red/Black SOR running on
32 processors of SGI Origin 2000
Asynchronous,
overlapped
parallelism
July 7, 2015
Ptools Annual Meeting
Program Database Toolkit (PDT)
Program code analysis framework for developing
source-based tools
High-level interface to source code information
Integrated toolkit for source code parsing, database
creation, and database query
commercial grade front end parsers
portable IL analyzer, database format, and access API
open software approach for tool development
Target and integrate multiple source languages
http://www.acl.lanl.gov/pdtoolkit
July 7, 2015
Ptools Annual Meeting
PDT Architecture and Tools
July 7, 2015
Ptools Annual Meeting
PDT Summary
Program Database Toolkit (Version 1.1)
LANL ACL Fall 1999 CD-ROM distributed at SC'99
EDG C++ Front End (Version 2.41.2)
C++ IL Analyzer and DUCTAPE library
tools: pdbmerge, pdbconv, pdbtree, pdbhtml
standard C++ system header files (KAI KCC 3.4c)
Fortran 90 IL Analyzer in progress
Automated TAU performance instrumentation
Program analysis support for SILOON (ACL CD)
“A Tool Framework for Static and Dynamic Analysis
of Object-Oriented Software,” submitted to SC ’00.
July 7, 2015
Ptools Annual Meeting
Distributed Monitoring Framework
Extend usability of TAU performance analysis
Access TAU performance data during execution
Framework model
each application context is a performance data server
monitor agent thread is created within each context
client processes attach to agents and request data
server thread synchronization for data consistency
pull mode of interaction
Distributed TAU performance data space
“A Runtime Monitoring Framework for the TAU
Profiling System,” ISCOPE ’99, Nov. 1999.
July 7, 2015
Ptools Annual Meeting
TAU Distributed Monitor Architecture
Each context has a monitor agent
Client in separate
thread directs agent
Pull model of
interaction
Initial HPC++
implementation
July 7, 2015
TAU profile database
Ptools Annual Meeting
Java Implementation of TAU Monitor
Motivations
July 7, 2015
more portable monitor middleware system (RMI)
more flexible and programmable server interface (JNI)
more robust client development (EJB, JDBC, Swing)
Ptools Annual Meeting
Future Plans
TAU
platforms: SGI Itanium, Sun Starfire, IBM Linux, ...
languages: Java (Java Grande) , OpenMP
instrument: automatic (F90, Java), Dyninst
measurement: hardware counter, support PAPI
displays: “beyond bargraphs” performance views
performance database and technology
support
for multiple runs
open API for analysis tool development
PDT
July 7, 2015
complete F90 and Java IL Analyzer
source browsers: function, class, template
tools for aiding in data marshalling and translation
Ptools Annual Meeting
Future Plans (continued)
Distributed monitoring framework
application and system monitoring
ACL Supermon
and SGI Performance Co-Pilot
scalable SMP clusters and distributed systems
Performance evaluation
performance monitoring clients
numerical libraries and frameworks
scalable runtime systems
ASCI application developers (benchmark codes)
Investigate performance issues in Linux kernel
Investigate integration with CCA
July 7, 2015
Ptools Annual Meeting
Conclusions
Complex parallel computing environments require
robust program analysis tools
TAU offers a robust performance technology
framework for complex parallel computing systems
portable, cross-platform, multi-level, integrated
able to bridge and reuse existing technology
technology savvy
flexible instrumentation and instrumentation
extendable profile and trace performance analysis
integration with other performance technology
Opportunities exist for open performance technology
July 7, 2015
Ptools Annual Meeting
Open Performance Technology (OPT)
Performance problem is complex
History of incompatible and competing tools
instrumentation / measurement technology reinvention
lack of common, reusable software foundations
Need “value added” (open) approach
diverse platforms, software development, applications
things evolve
technology for high-level performance tool development
layered performance tool architecture
portable, flexible, programmable, integrative technology
Opportunity for Industry/National Labs/PACI sites
July 7, 2015
Ptools Annual Meeting