Static Identification of Delinquent Loads V.M. Panait Sasturkar

Download Report

Transcript Static Identification of Delinquent Loads V.M. Panait Sasturkar

Static Identification of
Delinquent Loads
V.M. Panait
A. Sasturkar
W.-F. Fong
Agenda
Introduction
Related Work
Delinquent Loads
Framework
Address Patterns, Decision Criteria
The heuristic: types of classes,
computing the weights, final classes
Results
Introduction
Cache – one of the major current
bottlenecks in performance
One approach: prefetch; but prefetch
what ? Can’t prefetch everything…
Few loads are really “bad” – “delinquent
loads”
This paper: classification of address
patterns in the load instructions
Introduction
Done after code generation, but before
runtime
Singled out 10% of all loads causing
over 90% of the misses in 18 SPEC
benchmarks
Gets even better combined with basic
block profiling: 1.3% loads covering
over 80% of the misses
Related Work
BDH method: classify loads based on
following criteria:
Region of memory accessed by the load: S
(stack), H (heap) or G (global).
Kind of reference: loading a scalar (S),
element of array (A) or field of a structure
(S)
Type of reference: (P)ointer or (N)ot.
Related Work
Some classes account for most misses:
GAN, HSN, HFN, HAN, HFP, HAP.
The OKN method: 3 simple heuristics
Use of a pointer dereference
Use of a strided reference
None of the above
This paper is much more precise than
both above methods
Delinquent Loads
Why not stores too ? Write buffers are
apparently good enough
Why not do it in hardware ? They do,
but:
Need additional specialized hardware
Complex decisions (fast) <-> complex
hardware
Memory profiling: not always practical
Delinquent Loads & Profiling
Framework
Assembly code -> address patterns for
each load instruction -> placement of
the load instruction in a class
Classes + weights -> heuristic function
If the value of the heuristic is greater
than a delinquency threshold, the
instruction is classified as possibly
delinquent
Address Patterns
Address Pattern = summary of how the
source address of the load instruction is
computed
Uses CFG and DF analysis (reaching
definitions) (one address pattern for
each control path reaching the load)
Only uses basic registers (BR): gp, sp,
regparam, regret
The Decision Criteria
Classes are derived from these criteria
H1: Register usage in an address
pattern (usage of BR’s)
H2: Type of operations used in address
computation (arithmetic, logic)
H3: Maximum level of dereferencing
The Decision Criteria
H4: Recurrence (iterative walk through
memory)
H5: Execution frequency – based on BB
profiling; classifies loads as:
Rarely executed (used here as negative)
Seldom executed (idem)
Fairly often executed (not used here)
In a program hotspot
Decision Criteria and Classes
Each criterion results in a set of classes
Class = set of address patterns with a
certain property
There are too many classes that can
result; only some are considered, and
some of those are also aggregated into
one class
Decision Criteria and Classes
H1 – based classes: enumerations of
the number of occurrences of each of
the 4 BR’s in an address pattern
H2 – based classes: address patterns
with multiplications and shift operations
H3 – based classes: as many as there
are levels of dereferencing in the
address patterns
Decision Criteria and Classes
H4 – based classes: two classes
(address pattern involves recurrence or
not)
H5 – based classes: three classes:
rarely, seldom and program hotspot
Experimental Setup
SimpleScalar toolkit: cache simulator
(for cache hits & misses), compiler,
objdump
Procedure: Fortran -> C code (via f2c)
-> MIPS executable (via C2MIPS
compiler) -> disassembled code (via
objdump)
Reconstruction of CFG and DF analysis
Experimental Setup
2 stages: learning/training and
experimental (actual)
Stage 1: get full memory profiling data
on a subset of SPEC benchmarks, use it
to compute weights for each class
Use the heuristic thus obtained on a
new subset of benchmarks
The Heuristic: Types of Classes
Three types of classes:
Positive (loads in it are likely delinquent)
Negative (… not …)
Neutral
Positive classes have positive weights,
negative ones have negative weights,
neutral classes have a weight of zero
The Heuristic: Terminology
The miss probability of class F in
benchmark j:
M (F , C)
m j (F , C) 
iF E (i)
The amount of misses accounted for by
members of class F in benchmark j:
M (F , C)
n j (F , C) 
M ( P( I ), C )
The Heuristic: Terminology
mj(F,C) = likelihood of an instruction of
class F in benchmark j to be a cache miss
However, if that instruction is only
executed once, it won’t be a delinquent
load
nj(F,C) = proportion out of total number
of misses that members of F account for
The Heuristic: Terminology
Strength index: r = mj / nj
A benchmark j is irrelevant to a class F if
both indices mj and nj are below certain
thresholds. Otherwise it is relevant.
Positive class: r > 5% for all benchs.
Negative class: nj < 0.5% for all benchs.
Neutral class: r < 5% for 1+ benchs.
Computing the Weights
Form classes according to the five
decision criteria
Compute mj, nj for each class
Weight of class Fk
m j ( Fk , C )
1
W ( Fk ) 

|RFk | jRFk n j ( Fk , C )
Computing the Weights
This is the formula for positive classes
only
Only relevant benchmarks are included
in the formula
|.| is the cardinality of that set, i.e. the
number of benchmarks relevant to that
class
Aggregate Classes
AG1: both gp and sp are used 1+ each
(comes from H1)
AG2: only sp used 2+ (H1)
AG3: either * or shifts are used (H2)
AG4: one level dereferencing (H3)
AG5: two level dereferencing (H3)
AG6: three level dereferencing (H3)
Aggregate Classes
AG7: address patterns containing a
recurrence (H4)
AG8: loads with low frequency of
execution (100 < f < 1000) (H5)
AG9: loads with fairly low frequency of
execution (f < 100 times) (H5)
Weight formula for negative classes:
negated mean of positive weights
The Heuristic Function
 (i )  max
AG9
 W (k )  d ( j, k )
k  AG1
d ( j, k ) 
jk
1 if
0 otherwise
 (i )    the load is delinquent
Precision and Coverage
Precision of a heuristic scheme H, (H):
the (correct) number of loads that scheme
H identifies as delinquent (the lower, i.e.,
closer to the real one, the better)
Coverage of a heuristic scheme H, (H):
the number of cache misses caused by
loads identified as delinquent by scheme H
(the closer to 100%, the better)
Results on different inputs
Results when varying cache
associativity
Results when varying cache size
Performance on new
benchmarks
Performance summary
Performance of OKN & BDH
Performance with various 
Combination with BB profiling
Use the heuristic to sharpen the set
returned by BB profiling
Also add loads that are not in the
hotspots
 is the percentage of the highest scoring
loads detected by our method but not by
profiling that we consider to be delinquent
Combination with BB profiling
Conclusions
The static scheme for identifying
delinquent loads has a precision of 10%
and coverage of over 90% over 18
benchmarks
More precise than related work, similar
coverage
Immune to variation of framework
parameters (e.g. cache size, assoc., input)