Computational Informatics for Brain Electromagnetic Research

Download Report

Transcript Computational Informatics for Brain Electromagnetic Research

TAU: A Framework for Parallel
Performance Analysis
Allen D. Malony
[email protected]
ParaDucks Research Group
Computer & Information Science Department
Computational Science Institute
University of Oregon
Outline



Goals and challenges
Targeted research areas
TAU (Tuning and Analysis Utilities)




Tools associated with TAU




computation model, architecture, toolkit framework
performance system technology
examples of TAU use
PDT (Program Database Toolkit)
distributed runtime monitoring
Future plans
Conclusions
July 7, 2015
Ptools Annual Meeting
Goal and Challenges
Create robust (performance) technology for the
analysis and tuning of parallel software and systems

Challenges
July 7, 2015

different scalable computing platforms

different programming languages and systems

common, portable framework for analysis

extensibe, retargetable tool technology

complex set of requirements
Ptools Annual Meeting
Targeted Research Areas

Performance analysis for scalable parallel systems
targeting multiple programming and system levels
and the mapping between levels

Program code analysis for multiple languages
enabling development of new source-based tools

Integration and interoperation support for building
analysis tool frameworks and environments

Runtime tool interaction for dynamic applications
July 7, 2015
Ptools Annual Meeting
TAU (Tuning and Analysis Utilities)


Performance analysis framework for scalable parallel
and distributed high-performance computing
Target a general parallel computation model





computer nodes
shared address space contexts
threads of execution
multi-level parallelism
Network
Integrated toolkit for performance instrumentation,
measurement, analysis, and visualization


July 7, 2015
portable performance profiling/tracing facility
open software approach
Ptools Annual Meeting
TAU Architecture
July 7, 2015
Ptools Annual Meeting
TAU Instrumentation

Flexible, multiple instrumentation mechanisms

source code
 manual
 automatic

using PDT (tau_instrumentor)
object code
 pre-instrumented
libraries
 statically linked: MPI wrapper library using the MPI
Profiling Interface (libTauMpi.a)
 dynamically linked: Java instrumentation using JVMPI
and TAU shared object dynamically loaded in VM

executable code
 dynamic
July 7, 2015
instrumentation using DyninstAPI (tau_run)
Ptools Annual Meeting
TAU Instrumentation (continued)


Common target measurement interface (TAU API)
C++ (object-based) instrumentation



macro-based, using constructor/destructor techniques
function, classes, and templates
uniquely identify functions and templates
 name
and type signature (name registration)
 static object creates performance entry
 dynamic object receives static object pointer
 runtime type identification for template instantiations


with C and Fortran instrumentation variants
Instrumentation optimization
July 7, 2015
Ptools Annual Meeting
TAU Measurement

Performance information



high resolution timer library (real-time clock)
generalized software counter library
hardware performance counters
 PCL (Performance
Counter Library) (ZAM, Germany)
 PAPI (Performance API) (UTK, Ptools)
 consistent, portable API

Organization



July 7, 2015
node, context, thread levels
profile groups for collective events (runtime selective)
mapping between software levels
Ptools Annual Meeting
TAU Measurement (continued)

Profiling






Tracing




function-level, block-level, statement-level
supports user-defined events
TAU profile (function) database (PD)
function callstack
hardware counts instead of time
profile-level events
interprocess communication events
timestamp synchronization
User-controlled configuration (configure)
July 7, 2015
Ptools Annual Meeting
Timing of Multi-threaded Applications


Capture timing information on per thread basis
Two alternative

wall clock time
 works
on all systems
 user-level measurement

OS-maintained CPU time (e.g., Solaris, Linux)
 thread

virtual time measurement
TAU supports both alternatives

CPUTIME module profiles user+system time
% configure -pthread -CPUTIME
July 7, 2015
Ptools Annual Meeting
TAU Analysis

Profile analysis

pprof
 parallel

profiler with text-based display
racy
 graphical

interface to pprof
Trace analysis



trace merging and clock adjustment (if necessary)
trace format conversion (ALOG, SDDF, PV, Vampir)
Vampir
 trace
July 7, 2015
analysis and visualization tool (Pallas)
Ptools Annual Meeting
TAU Status

Usage

platforms
 IBM
SP, SGI Origin 2K, Intel Teraflop, Cray T3E, HP,
Sun, Windows 95/98/NT, Alpha/Pentium Linux cluster

languages
 C,

C++, Fortran 77/90, HPF, pC++, HPC++, Java
communication libraries
 MPI,

PVM, Nexus, Tulip, ACLMPL
thread libraries
 pthreads, Tulip,

compilers
 KAI,
July 7, 2015
SMARTS, Java,Windows
PGI, GNU, Fujitsu, Sun, Microsoft, SGI, Cray
Ptools Annual Meeting
TAU Status (continued)

application libraries
 Blitz++, A++/P++, ACLVIS,

application frameworks
 POOMA,

PAWS
POOMA-2, MC++, Conejo, PaRP
other projects
 ACPC,
University of Vienna: Aurora
 UC Berkeley (Culler): Millenium, sensitivity analysis
 KAI and Pallas

TAU profiling and tracing toolkit (Version 2.7)



July 7, 2015
LANL ACL Fall 1999 CD-ROM distributed at SC'99
Extensive 70-page TAU User’s Guide
http://www.acl.lanl.gov/tau
Ptools Annual Meeting
TAU Examples

Instrumentation




Measurement



C++ template profiling (PETE, Blitz++)
Java and MPI
PAPI
mapping of asynchronous execution (SMARTS)
hybrid execution (Opus/HPF)
Analysis

July 7, 2015
SMARTS scheduling
Ptools Annual Meeting
C++ Template Instrumentation (Blitz++, PETE)

High-level objects



Optimizations




array classes
templates
array processing
expressions (PETE)
Array expressions
Relate performance
data to high-level
statement
Complexity of
template evaluation
July 7, 2015
Ptools Annual Meeting
Standard Template Instrumentation Difficulties


Instantiated templates result in mangled identifiers
Standard profiling techniques and tools are deficient


integrated with proprietary compilers
specific systems platforms and programming models
Uninterpretable routine names
July 7, 2015
Ptools Annual Meeting
TAU Template Instrumentation and Profiling
Profile of
expression
types
Graphical pprof
July 7, 2015
Performance data presented
with respect to high-level
array expression types
Ptools Annual Meeting
Parallel Java Performance Instrumentation



Multi-language applications (Java, C, C++, Fortran)
Hybrid execution models (Java threads, MPI)
Java Virtual Machine Profiler Interface (JVMPI)



Java Native Interface (JNI)



event instrumentation in JVM
profiler agent (libTAU.so) fields events
invoke JVMPI control routines to control Java threads
and access thread information
MPI profiling interface
“Performance Tools for Parallel Java Environments,”
Java Workshop, ICS 2000, May 2000.
July 7, 2015
Ptools Annual Meeting
TAU Java Instrumentation Architecture
Java program
TAU package
Thread API
Event
notification
JVMPI
JNI
TAU
mpiJava package
MPI profiling interface
TAU wrapper
Native MPI library
Profile DB
July 7, 2015
Ptools Annual Meeting
Parallel Java Game of Life





mpiJava testcase
4 nodes,
28 threads
Node
process
grouping
Thread
message
pairing
Vampir
display
July 7, 2015

Multi-level event grouping
Ptools Annual Meeting
TAU and PAPI: NAS Parallel LU Benchmark




SGI Power Onyx (4 processors, R10K), MPI
Floating point
Percentage
operations
profile
Cross-node
full / routine
profiles
Full FP
profile for
each node
July 7, 2015
Ptools Annual Meeting
TAU and PAPI: Matrix Multiply




Data cache miss comparison,
“regular” vs. “strip-mining” execution
512x512
32 KB (P)
2 MB (S)
Regular
causes
4.5 times
more
misses
July 7, 2015
Ptools Annual Meeting
Asynchronous Performance Analysis (SMARTS)

Scalable Multithreaded Asynchronuous Runtime System




TAU measurement of asynchronous parallel execution




user-level threads, light-weight virtual processors
macro-dataflow, asynchronous execution interleaving
iterates from data-parallel statements
integrated with POOMA II
utilized the TAU mapping API
associate iterate performance with data parallel statement
evaluate different scheduling policies
“SMARTS: Exploting Temporal Locality & Parallelism
through Vertical Execution,” ICS '99, August 1999.
July 7, 2015
Ptools Annual Meeting
TAU Mapping of Asynchronous Execution
Without mapping
Two threads
executing
With mapping
POOMA / SMARTS
July 7, 2015
Ptools Annual Meeting
With and without mapping (Thread 0)
Without mapping
Thread 0 blocks
waiting for iterates
Iterates get lumped together
With mapping
Iterates distinguished
July 7, 2015
Ptools Annual Meeting
With and without mapping (Thread 1)
Array initialization performance lumped
Without mapping
Performance associated with ExpressionKernel object
With mapping
Iterate performance mapped
to array statement
Array initialization performance
correctly separated
July 7, 2015
Ptools Annual Meeting
TAU and Hybrid Execution in Opus/HPF



Fortran 77, Fortran 90, HPF
Vienna Fortran Compiling System
Opus / HPF



combined data (HPF) and task (Opus) parallelism
HPF compiler produces Fortran 90 modules
processes interoperate using Opus runtime system
 producer
/ consumer model
 MPI and pthreads

July 7, 2015
performance influence at multiple software levels
Ptools Annual Meeting
TAU Profiling of Opus/HPF Application
Multiple producers
Multiple consumers
Parallelism View
July 7, 2015
Ptools Annual Meeting
TAU Profiling of SMARTS
Iteration
scheduling
for two array
expressions
July 7, 2015
Ptools Annual Meeting
SMARTS Tracing (SOR) – Vampir Visualization

SCVE scheduler used in Red/Black SOR running on
32 processors of SGI Origin 2000
Asynchronous,
overlapped
parallelism
July 7, 2015
Ptools Annual Meeting
Program Database Toolkit (PDT)



Program code analysis framework for developing
source-based tools
High-level interface to source code information
Integrated toolkit for source code parsing, database
creation, and database query





commercial grade front end parsers
portable IL analyzer, database format, and access API
open software approach for tool development
Target and integrate multiple source languages
http://www.acl.lanl.gov/pdtoolkit
July 7, 2015
Ptools Annual Meeting
PDT Architecture and Tools
July 7, 2015
Ptools Annual Meeting
PDT Summary

Program Database Toolkit (Version 1.1)









LANL ACL Fall 1999 CD-ROM distributed at SC'99
EDG C++ Front End (Version 2.41.2)
C++ IL Analyzer and DUCTAPE library
tools: pdbmerge, pdbconv, pdbtree, pdbhtml
standard C++ system header files (KAI KCC 3.4c)
Fortran 90 IL Analyzer in progress
Automated TAU performance instrumentation
Program analysis support for SILOON (ACL CD)
“A Tool Framework for Static and Dynamic Analysis
of Object-Oriented Software,” submitted to SC ’00.
July 7, 2015
Ptools Annual Meeting
Distributed Monitoring Framework



Extend usability of TAU performance analysis
Access TAU performance data during execution
Framework model







each application context is a performance data server
monitor agent thread is created within each context
client processes attach to agents and request data
server thread synchronization for data consistency
pull mode of interaction
Distributed TAU performance data space
“A Runtime Monitoring Framework for the TAU
Profiling System,” ISCOPE ’99, Nov. 1999.
July 7, 2015
Ptools Annual Meeting
TAU Distributed Monitor Architecture

Each context has a monitor agent

Client in separate
thread directs agent
Pull model of
interaction
Initial HPC++
implementation


July 7, 2015
TAU profile database
Ptools Annual Meeting
Java Implementation of TAU Monitor

Motivations



July 7, 2015
more portable monitor middleware system (RMI)
more flexible and programmable server interface (JNI)
more robust client development (EJB, JDBC, Swing)
Ptools Annual Meeting
Future Plans

TAU






platforms: SGI Itanium, Sun Starfire, IBM Linux, ...
languages: Java (Java Grande) , OpenMP
instrument: automatic (F90, Java), Dyninst
measurement: hardware counter, support PAPI
displays: “beyond bargraphs” performance views
performance database and technology
 support
for multiple runs
 open API for analysis tool development

PDT



July 7, 2015
complete F90 and Java IL Analyzer
source browsers: function, class, template
tools for aiding in data marshalling and translation
Ptools Annual Meeting
Future Plans (continued)

Distributed monitoring framework

application and system monitoring
 ACL Supermon
and SGI Performance Co-Pilot
 scalable SMP clusters and distributed systems


Performance evaluation





performance monitoring clients
numerical libraries and frameworks
scalable runtime systems
ASCI application developers (benchmark codes)
Investigate performance issues in Linux kernel
Investigate integration with CCA
July 7, 2015
Ptools Annual Meeting
Conclusions

Complex parallel computing environments require
robust program analysis tools




TAU offers a robust performance technology
framework for complex parallel computing systems




portable, cross-platform, multi-level, integrated
able to bridge and reuse existing technology
technology savvy
flexible instrumentation and instrumentation
extendable profile and trace performance analysis
integration with other performance technology
Opportunities exist for open performance technology
July 7, 2015
Ptools Annual Meeting
Open Performance Technology (OPT)

Performance problem is complex



History of incompatible and competing tools



instrumentation / measurement technology reinvention
lack of common, reusable software foundations
Need “value added” (open) approach




diverse platforms, software development, applications
things evolve
technology for high-level performance tool development
layered performance tool architecture
portable, flexible, programmable, integrative technology
Opportunity for Industry/National Labs/PACI sites
July 7, 2015
Ptools Annual Meeting