Document 7544064

Download Report

Transcript Document 7544064

Performance Analysis, Tools and
Optimization
Philip J. Mucci
Kevin S. London
University of Tennessee, Knoxville
ARL MSRC Users’ Group Meeting
September 2, 1998
PET, UT and You
•
•
•
•
•
•
Training
Environments
Benchmarking
Evaluation and Reviews
Consulting
Development
Training
• Courses on Benchmarking, Performance
Optimization, Parallel Tools
• Provides good mechanism for technology
transfer
• Develop needs and direction from the
interaction with the user community
• Tremendous knowledge base from which to
draw
Environments
• Use of the MSRC environments provides
–
–
–
–
–
–
Bug reports to the vendor
System tuning
System administrator support
Analysis of software needs
Performance evaluation
Researchers access to advanced hardware
Performance Understanding
•
•
•
•
•
In order to optimize we must understand
Why is our code performing a certain way?
What can be done about it?
How good can we do?
Results in confidence, efficiency and better
code development
• Time spent is an investment in the future
Tool Evaluation
Ptools Consortium
• Review of available performance tools,
particularly parallel
• Regular reports are issued
• Tools that we find useful get presented to
the developers in training or consultation
• Installation, testing and training
• Example: VAMPIR for scalability analysis
Optimization Course
• Course focuses on compiler options,
available tools and single processor
performance
• Single biggest bottleneck to many codes,
especially cache performance
• Why? Link speeds have increased within an
order of magnitude of memory bandwidths
• Also, MPI and language specific issues
Benchmarks
• CacheBench - performance of the
memory hierarchy
• MPBench - performance of core MPI
operations
• BLASBench - performance of dense
numerical kernels
• Intended to provide an orthogonal set of
low-level benchmarks with which we can
parameterize codes
Cache Performance
Cache Performance
• Tuning for caches is difficult without some
understanding of computer architecture
• No way to really know what’s in the cache
during a given point in an application
• Factor of 2-4 performance increase is
common
• Develop a tool to help identify regions in
the source code, a specific reference.
Cache Simulator
• Profiling the code reveals cache problems
• Automated instrumentation of offending
routines via a GUI or by hand
• Link with simulator library
• Make architecture configuration file
• Addresses are traced and simulated
• Miss locations are recorded and reports are
generated
PerfAPI
• A standardized interface to hardware
performance counters
• Easily usable by application engineers as
well as tool developers
• Intended for
– Performance tools
– Evaluation
– Modeling
• Watch http://www.cs.utk.edu/~mucci/pdsa
High Performance Debugger
• Industry wide lack of good debugging
support for parallel programs
• TotalView is expensive and GUI only
• Bandwidth is often not-available off-site
• Based on dbx and gdb as backends
• Uses p2d2 from NASA as a framework
• Standardized, familiar command-line
interface
MPI Connect
• Connects separate MPI jobs with PVM
• 3 function calls to enroll
• Uses include
– Metacomputing with Vendor MPI
– Dynamic and Fault Tolerant MPI jobs now
The Future
•
•
•
•
•
•
•
BYOC Workshops
Regular Training Schedule
Web Based Training
Consulting
Cross-MSRC Information Exchange
Technology Transfer
Tool development
Origin 2000 Performance
Prescription
• Always use dplace on all codes
• Always use
-LNO:cache_size2=4096
• For accuracy compile and link with
-O2 -IPA -SWP:=ON -LNO -TENV:X=0-5
• or
-Ofast=ip27 -OPT:roundoff=0-3
-OPT:IEEE_arithmetic=1-3
Origin 2000 Performance
Prescription
• In Fortran, innermost array index should
change fastest
• Use functions in
-lcomplib.sgimath or -lscs
-lfastm
-lm
• Use MPI_Ixxxx primitives
• Always execute IRECV early
Vampir Timeline Display
Vampir Global Activity Chart
Identifying a Message in Vampir
Identifying a Message in Vampir
Nupshot Display