A Survey about Performance Counters, Libraries and Tools

Transcript A Survey about Performance Counters, Libraries and Tools

A Survey about
Performance Counters,
Libraries and Tools
Joseph Bryant Manzano Franco
Agenda
 Introduction
 W3H: The Why, The What, The When, and The How
 Hardware Performance Libraries
 Performance Application Programming Interface (PAPI)
 Performance Counters Libraries (PCL)
 Visualization Tools
 TAU: An example of a data collector
 KOJAK: Semi automatic instrumentation tool
 VAMPIR: An example of a script language
 PE: The All levels approach
Introduction
Program Optimization
Search for the most effective
algorithms and data structures
Algorithm Optimization
Other ubiquitous optimizations
Architecture Optimizations
Data Collection
Data Analysis
Consider common architecture
features such cache structures
Apply architecture specific
characteristic (PIM instructions, atomic
load and stores, massive memory
allocations, etc)
Identify and solve unexpected
problems with the interaction
between hardware and software
(memory and network bottlenecks,
false sharing, poor cache
management, etc)
Introduction
The Why
Data Collection
Data Analysis
High Level Library Functions
Manual Analysis
Easy to use and available on almost all libraries.
Restricted and intrusive
Compose of timing function and clever data manipulation
Performance Counters
Easy to use (especially with high level wrappers)
Provides a range of measurements and is less intrusive
Simulation environments
Complete control over the environment including hardware,
memory hierarchies and application code.
Development is long for new architectures
Steep learning curve
Simple, but limited in its use
Prone to human error
Automatic Statistical Analysis
Organize the data in a suitable format
Still need to deal with numbers
Visualization Tools
Graphical representation of data or its
properties. Easy to identify trends even in
large sets of data
Introduction:
The What
Performance Counters
Special Registers that are present in an specific architecture
Designed to count architectural events
• An event is defined as an action that the hardware takes
• Predefined
• Examples: cache misses / hits, TLB misses / hits, context
switches, cache invalidations, total instructions, etc
Sun Ultra SPARC  Two 32 bit registers called PIC (Performance Instrumentation
Counters). User control restricted
Pentium Pro Two 40 bit registers called PerfCrt0/1. User control available
Introduction:
The When
Date
Machine/Author
Method of
reading/Document
1966
Don Widring
Initial Metering Design
~1970
GE 645
Multics
1979
Honeywell 6180
Yellow Submarine
1983
Cray-XM
User Accessible Registers
Late 80 / early 90
IBM 3090 Mainframes, First
generation IBM RS/6000
Restricted and Confidential
1992
First Alpha Chip (DEC)
Uprofile, kprofile or IPROBE
1993
Pentium
Not documented and embedded
in the MSR
Introduction:
The How
Example: Ultra SPARC Architecture
Two counters - 32 bits each
Event that are being counted: Number of Instructions (pic0), and Cache
invalidations (pic1)
pic1
pic0
4
0
1
2
3
pic0
0
1
0
1
2
3
4
CPU
CPU
$
$
Bus
pic1
0
1
load 0,s1
load 0,s1
load 1,s2
load 1,s2
inc s2
add s1, s2, s1
load 0,s1
store 0,s1
Agenda
 Introduction
 W3H: The Why, The What, The When, and The How
 Hardware Performance Libraries
 Performance Application Programming Interface (PAPI)
 Performance Counters Libraries (PCL)
 Visualization Tools
 TAU: An example of a data collector
 KOJAK: Semi automatic instrumentation tool
 VAMPIR: An example of a script language
 PE: The All levels approach
Hardware Performance Libraries
 Performance Counters: Good idea, but only
accessible to hardware experts.
 Solution: High Level Wrappers.
 Usually written in C and Fortran.
 Easy to make them thread safe and to
integrate them in existent code.
 Examples:


Performance Application Programming
Interface (PAPI)
Performance Counters Library (PCL)
Performance Application
Programming Interface
 A high Level wrapper functions that includes a vast set of architectures
and events

Available for Power3, Power4, Ultra SPARC II and III, all flavors of
Pentium, Itanium, AMD Athlon, etc.
 Well documented, stable and reliable programming interface.
 Goals of the PAPI project:
To provide a solid foundation for cross platform performance analysis
tools
 To present a set of standard definitions for performance metrics on all
platforms
 To provide a standardize API among users, vendors, and academics
 To be easy to use, well documented, and freely available
(Excerpt obtained from the PAPI user guide)

 PAPI is an effort of the Innovative Computer Laboratory (ICL) that is
part of the Department of Computer Science at the University of
Tennessee
Overhead
Block Diagram
PAPI
Portable
Layer
Machine
Dependent
Layer
Low Level API
High Level API
Substrate
Kernel Extensions
Operating System
Hardware Performance Counters
Platform
Altix (Itanium 2 -Madison Chip)
IBM Power 4
Itanium 2 (libpfm 2.0)
Pentium 3 (perfctr 2.4.5)
Pentium 4 (perfctr 2.4.5)
SGI R12k
Ultrasparc II
PAPI_read() – PAPI 3.0
1357 Cycles/Call
4034 Cycles/Call
1606 Cycles/Call
324 Cycles/Call
401 Cycles/Call
3681 Cycles/Call
2150 Cycles/Call
PAPI:
Terminology
 Native Events:
Defined as countable by an specific CPU.
 Machine dependent
 Hexadecimal value and a mask provided by PAPI libraries
 Present Events:
 Predefined events.
 Events (or group of events) that are considered useful and
relative ubiquitous across architectures.
 A PAPI identifier is provided
 Event List:
 A array of events (usually the consist of PAPI identifiers)

PAPI:
Terminology
 High Level API:
 A group of functions
 A single of list of events
 Access to Native Events is prohibited.
 Flexibility and performance is lost due to its easiness to
use
 Low Level API:
 Another group of functions
 Multiple event list definitions and native events
interface.
 Only one event list can be running at any point in time
PAPI:
Steps
#include <papi.h>
#include <stdio.h>
#define NUM_EVENTS 2
int main(int argc, char **argv)
{
int Events[NUM_EVENTS] = { PAPI_TOT_INS, PAPI_TOT_CYC };
long_long values[NUM_EVENTS], val2[NUM_EVENTS];
int a= 0;
int retval;
retval = PAPI_library_init(PAPI_VER_CURRENT);
PAPI_start_counters(Events, 2);
PAPI_read_counters(values, 2);
a++;
PAPI_read_counters(values, 2);
PAPI_read_counters(val2, 2);
printf("The value of a is: %i \n", a);
printf("The Coarse Instructions are: %10lld\n", values[0]);
printf("The Coarse Cycles are: %10lld\n", (values[1]));
printf("The Overhead Instructions are: %10lld\n", val2[0]);
printf("The Overhead Cycles are: %10lld\n", (val2[1]));
printf("The Total Instructions are: %10lld\n", (-val2[0] + values[0]));
printf("The Total Cycles are: %10lld\n", (-val2[1] + values[1]));
PAPI_stop_counters(values, 2);
return 0;
}
Initialization of the PAPI library
Start the counters
Operate on the counters
Stop the counters
De-allocate any resource that
has been allocated
PAPI:
Output
The value of a is: 1
The Coarse Instructions are:
The Coarse Cycles are:
The Overhead Instructions are:
The Overhead Cycles are:
The Total Instructions are:
The Total Cycles are:
179
641
175
395
4
246
Assembly Output of a++
ld
[%fp-52],%l0
add %l0,1,%l0
st
%l0,[%fp-52]
add %fp,-32,%o0
The first access to produce a (L2) cache miss
PAPI:
Extra Features
 Multithread safe and support
 Multiplexing where available
 Overflow control with thresholds
 Statistical Profiling and related functions
 Error detection and control features
Performance Counters Libraries
 Another Example of High Level performance counters
 Events are classified (as in PAPI) as Memory Hierarchy events
(caches, TLB, memory, etc), Instructions (Instruction types,
Instructions completed, etc), Status of Functional Units and
rates and ratios.
 It supports the Pentium architectures up to Pentium 4, the AMD
Athlon / Duron, the IBM Power series up to Power 3-II, Alpha’s
21164 and 21264, SGI’s R10000 and R12000 and the
UltraSPARC family of processors
 PCL is available for C, C++ and Java
 PCL is an effort of Forschungszentrum Juelich GmbH and the
University of Applied Sciences Bonn-Rhein-Sieg from Germany
and currently it is in its second version
PCL
 High Level API:
Similar to PAPI High Level API but the functions are different.
 Events lists can be created in this API
 Access to predefine events only
 Recommended
 Low Level API:
 Let to access the performance counters directly
 Not recommended
 Handle:
 A single Data (usually an integer) that is used to uniquely
identify a set of resources.
 Used to provide a thread specific link to the resources (the
list of events)

PLC:
Steps
#include <pcl.h>
int main(int argc, char **argv)
{
int counter_list[2], a = 0;
int ncounter;
unsigned int mode;
PCL_CNT_TYPE i_result_list[2];
PCL_FP_CNT_TYPE fp_result_list[2];
PCL_DESCR_TYPE descr;
PCLinit(&descr);
ncounter = 2;
counter_list[0] = PCL_CYCLES;
counter_list[1] = PCL_INSTR;
mode = PCL_MODE_USER;
PCLstart(descr, counter_list, ncounter, mode);
a++;
PCLstop(descr, i_result_list, fp_result_list, ncounter);
printf("%f instructions in %f cycles\n",
(double)i_result_list[1], (double)i_result_list[0]);
PCLexit(descr);
return 0;
}
Initialization of the PCL library
Start the counters
Operate on the counters
Stop the counters
De-allocate any resource that
has been allocated
PLC:
Differences with PAPI
 Nested function call enabled
 Rates and Ratios are function calls in PAPI
libraries
 Low Level API deals with native code as
PAPI’s Low level does but its used is not
recommended in PCL
Agenda
 Introduction
 W3H: The Why, The What, The When, and The How
 Hardware Performance Libraries
 Performance Application Programming Interface (PAPI)
 Performance Counters Libraries (PCL)
 Visualization Tools
 TAU: An example of a data collector
 KOJAK: Semi automatic instrumentation tool
 VAMPIR: An example of a script language
 PE: The All levels approach
Visualization Tools
 After gathering the information for the tools, how to
present it to the user in the most efficient matter?
 The visualization tools provide a good way to present
trends in data across extensive data sets
 Examples of Visualization tools:




Tuning and Analysis Utilities
Kit for Objective Judgement and Knowledge-based
Detection of Performance Bottlenecks
VAMPIR / VAMPIRTrace
Performance Evaluator
Tuning and Analysis Utilities (TAU)
 Program and Performance analysis tool framework for high-
performance parallel and distributed computing.
 A suite of tools for static and dynamic analysis of programs written in C,
C++, FORTRAN 77/90, Python, High Performance FORTRAN, and
Java.
 Instrumentation by functions
 The concept of Inclusive and Exclusive

With Time



Exclusive time  Refers to the time spent in the function minus all the
time spent on functions that has instrumented and called by this
function
Inclusive time  Total time of the function
With Performance Counter

The same as time with the properties of that performance counter
 Supported extensions in C and FORTRAN: MPI and OpenMP
 Hardware Counters supported: PAPI and PCL
TAU Infrastructure
KOJAK
 Kit for Objective Judgement and Knowledge-based
Detection of Performance Bottlenecks
 A complete infrastructure dedicated to find
performance bottlenecks and application properties
 Consists of the following components



OpenMP Pragma And Region Instrumentor (OPARI)
(Redirect the OpenMP function call and directives
toward wrappers that contains instrumentation
information (POMP)) and PMPI
TAU (function instrumentation)
Event Processing, Investigating and Logging (EPILOG)
runtime library (event oriented trace creator utility)
KOJAK



Extensive Performance Tool (EXPERT) (trace
files analyzer  search for low performing
sections on them and classify them according
to severity) uses the Event Analysis and
Recognition Library (EARL)
CUBE (KOJAK’s Trace visualization tool)
Trace transformations to different formats (to
VAMPIR trace format)
KOJAK Infrastructure
KOJAK Snapshots
KOJAK Snapshots
VAMPIR
 A configurable visualization trace tool
 Converts trace information into a variety of graphical views:





Process State Display
Statistics Display
Timeline Display
Communications Statistics
Configured by using


Pull-down menus
Configuration file
 The displays can be related to the source code
 Zoom in and Zoom out Advance feature
 Defined trace format: VAMPIR-Trace (runtime library enhanced
with trace creation calls)
VAMPIR Infrastructure
Source Code
Guide Compiler
Executable
Object Files
Linker
VAMPIRTrace Libraries
Config File
Guide Libraries
Trace File
VAMPIR V
VAMPIR Snapshot
Performance Evaluator
 Java Based Tool
 All level analysis of a program behavior:
 Application Software level analysis
 Data / Algorithm Analysis
 Operation System level analysis
 Thread context switching
 Thread scheduling
 Hardware Level Analysis
 Memory Hierarchy
 Used PMAPI performance counters (IBM proprietary)
Performance Evaluator Infrastructure
K42 Infrastructure
3
2
1
AIX OS
Parser / Modifier
Others
PE2 Trace Format
PE Trace Format
3
2
1
PE2 Visualization Tool
1  Trace Format File
2  Map File
3  Meta File
Performance Evaluator:
A Run
 Get Hardware Information from the
infrastructures (the source has been
instrumented and the OS is collecting
information also)
 Create:



Trace file (s)  Trace records of a program
with short hand versions of events
Map file  Have static information about
functions, threads and other structures
Meta file (s)  Properties of a trace, records
type definitions and Map type definitions
Performance Evaluator:
A Run
 Feed the files to the tool
 Visualize the information with graphs
 Contemplate the whole application behavior
since beginning to the end
 Complete GUI with the Eclipse Workbench
 Designed to work with several Multi Threaded
packages in C and Java
 OpenMP not supported
Questions? Comments?
Thanks so much for your time

A Survey about Performance Counters, Libraries and Tools

Transcript A Survey about Performance Counters, Libraries and Tools

Directory