Visualizations and counter recommendations

Transcript Visualizations and counter recommendations

Hardware Counters and
Visualizations
Adam Leko
5/31/2005
HCS Research Laboratory
University of Florida
PAPI Hardware Counters


In general, we should allow the user to collect any PAPI metric they
would like
However, if we want to create a list of counters to “support” that we
ensure work on all platforms & may be used during analysis, I suggest:










L1 I-cache misses
L2 D-cache misses
Total cycles
TLB misses
FLOPS
IOPS (derived from total cycles, total integer instructions issued)
Conditional branches mispredicted
Request for access to shared line
Hardware interrupts
The metrics above would give a rough indication of overall performance

But, can’t really predict what kind of information a user might want….
2
L1 I-Cache Misses, L2 D-Cache Misses


L1 I-cache misses
 In normal program execution, instructions high very high spatial
and temporal locality
 Therefore, a miss at L1 is likely to also incur a miss at L2 (and
L3)
 L1 I-cache misses provide a quick indicator for “branchy” code
that could stall instruction issue
L2 D-cache misses
 L1 D-cache probably too small to fit working set of most code
 L2 D-cache getting larger and larger in modern chips, so a miss
at L2 is probably more significant for indicating code with poor
data locality
 For most applications, L2 D-cache miss rate probably more
important than L1 D-cache miss rate

But can’t be sure!!
3
Total Cycles, IOPS, FLOPS, TLB Misses


General indicators that should be easy to relate to
Total cycles gives a top-level idea of what is taking
the most CPU time


In general, more IOPS and FLOPS vs. total cycles =
better use of hardware


Should be supplemented with wall clock time obtained from
outside of PAPI
Given the same algorithm…
TLB misses also give an estimate of OS page faults,
can indicate code that has poor locality at a higher
level

Also, many TLB misses can indicate a shift in working set
of cache pages
4
Conditional Branch Mispredictions, Shared
Line Accesses, Hardware Interrupts



Branch mispredictions indicate code that is causing trouble for
branch predictor
 Consistent bad predictions = many stalls for instructions
 Probably doesn’t happen very often, but can have a significant
impact on overall execution time
Share line accesses
 Probably most interesting SMP-related metric
 Should indicate data contention at a low level
Hardware interrupts
 Consistent interrupts can really pollute I- and D-cache
 Can also indicate hardware that is not operating in a DMA mode
(PIO generating many interrupts)
 Can be used to approximate general OS overhead on “dedicated”
systems with non-lightweight OS (Linux clusters, Tru64, etc)
5
Visualizations

Graphical visualizations

Two modes of operation: “top-down” and “bottom-up”



Top-down = profile view, displayed alongside or under source
code
Bottom-up = trace view, source code correlation obtained by
right-clicking on events in trace window
Also: “other” views

Communication/data access display matrix, color-coded via a
gradient


Also presented in a way that shows array-specific communication
for UPC
Command-line visualizations


Print dump of all trace events (like elg_print for KOJAK)
Give bottom-up table of profiling information
6

Visualizations and counter recommendations

Transcript Visualizations and counter recommendations

Directory