Visualizations and counter recommendations
Download
Report
Transcript Visualizations and counter recommendations
Hardware Counters and
Visualizations
Adam Leko
5/31/2005
HCS Research Laboratory
University of Florida
PAPI Hardware Counters
In general, we should allow the user to collect any PAPI metric they
would like
However, if we want to create a list of counters to “support” that we
ensure work on all platforms & may be used during analysis, I suggest:
L1 I-cache misses
L2 D-cache misses
Total cycles
TLB misses
FLOPS
IOPS (derived from total cycles, total integer instructions issued)
Conditional branches mispredicted
Request for access to shared line
Hardware interrupts
The metrics above would give a rough indication of overall performance
But, can’t really predict what kind of information a user might want….
2
L1 I-Cache Misses, L2 D-Cache Misses
L1 I-cache misses
In normal program execution, instructions high very high spatial
and temporal locality
Therefore, a miss at L1 is likely to also incur a miss at L2 (and
L3)
L1 I-cache misses provide a quick indicator for “branchy” code
that could stall instruction issue
L2 D-cache misses
L1 D-cache probably too small to fit working set of most code
L2 D-cache getting larger and larger in modern chips, so a miss
at L2 is probably more significant for indicating code with poor
data locality
For most applications, L2 D-cache miss rate probably more
important than L1 D-cache miss rate
But can’t be sure!!
3
Total Cycles, IOPS, FLOPS, TLB Misses
General indicators that should be easy to relate to
Total cycles gives a top-level idea of what is taking
the most CPU time
In general, more IOPS and FLOPS vs. total cycles =
better use of hardware
Should be supplemented with wall clock time obtained from
outside of PAPI
Given the same algorithm…
TLB misses also give an estimate of OS page faults,
can indicate code that has poor locality at a higher
level
Also, many TLB misses can indicate a shift in working set
of cache pages
4
Conditional Branch Mispredictions, Shared
Line Accesses, Hardware Interrupts
Branch mispredictions indicate code that is causing trouble for
branch predictor
Consistent bad predictions = many stalls for instructions
Probably doesn’t happen very often, but can have a significant
impact on overall execution time
Share line accesses
Probably most interesting SMP-related metric
Should indicate data contention at a low level
Hardware interrupts
Consistent interrupts can really pollute I- and D-cache
Can also indicate hardware that is not operating in a DMA mode
(PIO generating many interrupts)
Can be used to approximate general OS overhead on “dedicated”
systems with non-lightweight OS (Linux clusters, Tru64, etc)
5
Visualizations
Graphical visualizations
Two modes of operation: “top-down” and “bottom-up”
Top-down = profile view, displayed alongside or under source
code
Bottom-up = trace view, source code correlation obtained by
right-clicking on events in trace window
Also: “other” views
Communication/data access display matrix, color-coded via a
gradient
Also presented in a way that shows array-specific communication
for UPC
Command-line visualizations
Print dump of all trace events (like elg_print for KOJAK)
Give bottom-up table of profiling information
6