Automatic Measurement of Instruction Cache Capacity in X-Ray Kamen Yotov [email protected] IBM T. J. Watson Research Center Joint work with: Tyler Steele, Sandra Jackson, Keshav Pingali, Paul Stodghill Department.

Download Report

Transcript Automatic Measurement of Instruction Cache Capacity in X-Ray Kamen Yotov [email protected] IBM T. J. Watson Research Center Joint work with: Tyler Steele, Sandra Jackson, Keshav Pingali, Paul Stodghill Department.

Automatic Measurement of
Instruction Cache Capacity
in X-Ray
Kamen Yotov
[email protected]
IBM T. J. Watson Research Center
Joint work with:
Tyler Steele, Sandra Jackson,
Keshav Pingali, Paul Stodghill
Department of Computer Science
Cornell University
11/7/2015
QEST'05
1
Motivation: self-optimizing software


Goal: portable performance
Self-optimizing software
 Generates
code with parameters whose optimal
values depend on the platform (hardware / OS /
compiler)
 Determines experimentally optimal parameter values
 Uses native C compiler to produce library

Examples: ATLAS, FFTW, SPIRAL, …
11/7/2015
QEST'05
2
NU
K
Example:
Register Blocking for MMM
B
Hardware parameters


Number of FP registers (NR)
I-Cache Capacity (ICC)
A simple model for the
register tile size for MMM



Yotov et al. IEEE’05
MU x NU + MU + NU + Temp ≤ NR
KU (unroll of K loop)



NB

NB
MU

K
A
C
does not depend on NR
depends on ICC
Need to know NR and ICC!
11/7/2015
QEST'05
3
Why not consult the manuals?

Self-optimizing systems


Actual hardware values vs.
number available for optimization





11/7/2015
For software optimization, hardware values may not be relevant
(e.g.) number of hardware registers may not be equal to number of registers
available for holding program values (register 0 on SPARC)
Incomplete


Require online manuals
Parameters like capacity and line size of off-chip caches vary from model to
model
Even same model of computer may be shipped with different cache
organizations
Not usually documented in processor manuals
Moving Target
QEST'05
4
Automatic Measurement Tools

lmbench




Calibrator




OS benchmark, some CPU / Memory benchmarks
Larry McVoy, BitMover, Inc.
Carl Staelin, HP
Memory hierarchy benchmark
Stefan Manegold
Centrum voor Wiskunde en Informatica
MOB



11/7/2015
Memory hierarchy benchmark
Josep Blanquer, Robert Chalmers
University of California Santa Barbara
QEST'05
5
X-Ray

Set of micro-benchmarks in ANSI C89



Download and compile on any architecture (portable)
Deduce hardware parameter values from timing results
Some amount of O/S specific code


High-resolution timing routines
Super-page allocation



Currently support Linux
Windows and Solaris, IRIX, and AIX in the works
Paradox


11/7/2015
Compiler optimizations may contaminate timing results
Cannot afford to turn off all optimizations
QEST'05
6
Example: Latency of Integer ADD
(Step by Step)
t = gettime();
r1 += r2;
return gettime() – t;
Problem: hard to measure small time intervals accurately
11/7/2015
QEST'05
7
Step by Step (cont.)
t = gettime();
while (--R) //R is number of repetitions
r1 += r2;
return gettime() – t;
Problem: loop overhead
11/7/2015
QEST'05
8
Step by Step (cont.)
t = gettime();
i = R / U;
while (--i) //loop unrolled U times
{
r1 += r2;
r1 += r2;
........
r1 += r2;
}
return gettime() – t;
Problem: compiler optimizations
11/7/2015
QEST'05
9
Step by Step (cont.)
t = gettime();
i = R / U;
switch (v)
{
case 0: loop:
case 1: r1 += r2;
case 2: r1 += r2;
.................
case U: r1 += r2;
if (--i)
goto loop;
}
if (!v) return gettime() – t; else use(r1,r2);
Solution: “volatile int v = 0”
11/7/2015
QEST'05
10
Latency of integer ADD:
nano-benchmark C code

Want to measure
 r1+=r2

Generate C Code
from specification
 <r1+=r2,
11/7/2015
<r1, r2: int>>
volatile int v = 0;
volatile int vr = 0;
register int r1 = vr;
register int r2 = vr;
t = gettime();
i = R / U;
switch (v)
{
case 0: loop:
case 1: r1 += r2;
case 2: r1 += r2;
.................
case U: r1 += r2;
if (--i)
goto loop;
}
if (!v)
return gettime() – t;
else
{
vr = r1;
vr = r2;
}
QEST'05
11
X-Ray architecture
Micro-benchmark
Parameters
Control
Engine
Nano-benchmark
Specification
Nanobenchmark
Generator
Nano-benchmark
C Code
Compile,
Execute,
Time
Execution Time
Hardware
Parameter Value
11/7/2015
QEST'05
12
Instruction Throughput

Specification

Control Engine
N=3, B=1:
11/7/2015
QEST'05
13
Micro-benchmarks in X-Ray

CPU

Frequency
 Instruction Latency
 Instruction Throughput
 Instruction Existence




FPU on embedded processors
FMA on general purpose processors
SMP and SMT
Memory Hierarchy


Number of Registers of various types (int, float, SSE, …)
Multilevel Caches, TLB





11/7/2015
Associativity
Block Size
Capacity
Latency
Instruction Cache Capacity
QEST'05
14
Previous Approaches for
Memory Hierarchy Parameters

Saavedra Benchmark (Hennessy-Patterson)



Accesses elements of an array constant stride apart
Measures average memory access time
Deficiencies



Considers all levels simultaneously
Works only for capacities that are powers-of-2
Suffers from a number of implementation level deficiencies




Constant stride accesses
Loop overhead problems
Overlapping memory operations
Prone to compiler “optimizations”
S
11/7/2015
S
QEST'05
S
15
Example:
Isolation of lower cache levels

Idea for Ln measurements
L2

Use sequences as for L1
measurements
 Make L1…Ln-1 “transparent” to
measurements
 Unique in isolating the behavior
of Ln so that all higher levels miss

L3
L1
CPU
Approach
S

Use sequences of sequences
 Convolution of sequences
S
S

=
11/7/2015
QEST'05
16
Measuring I-Cache Capacity

Approach for Data Cache does not work
of pointers  Code sequence with branches
 Such branches are very predictable
 Nearly impossible to get precise timing
 Array

Measure time to execute special code sequence
of size N statements
 Find
the biggest N for which there is no significant
increase in time per statement
11/7/2015
QEST'05
17
Nano-benchmark

Similar to Instruction
Throughput
 Parameters
(1, 4)
 Grow length N

Code size computed

11/7/2015
(char *)&&finish –
(char *)&&start
QEST'05
18
Sensitivity

Graph for Pentium M
9

more in the paper
Performance
oscillates
 Even
after averaging
out noise
 Cannot wait for jump
 Need more robust
measurement
11/7/2015
QEST'05
19
Control Engine Script


Start with N=256
Compute
 Mean
 Standard
deviation
 For

Binary-search
 Detect
jump when time
is more than
11/7/2015
QEST'05
20
Experimental Results
11/7/2015
QEST'05
21
Pentium 4




Does not cache ISA
instructions, but uops
Trace cache
Measure the number
of instructions
Smoothing in the
nano-benchmark:
minimum of time in
11/7/2015
QEST'05
22
Conclusions

X-Ray: A framework and tool




Algorithms for precise measurements of some important
hardware parameters
Experimental results on many modern architectures
Other X-Ray resources



First to measure instruction cache capacity
Memory Hierarchy parameter measurement appeared at
SIGMETRICS’05
CPU parameter measurement appeared at QEST’05
Improving X-Ray is work in progress…
11/7/2015
QEST'05
23
Current and Future Work





2-address vs. 3-address code
Out-of-Order execution
Number Physical registers
Number / Type Functional Units
Cache
 bandwidth
 write
mode
 sharedness
 replacement policy
11/7/2015
QEST'05
24
Thank you!

My E-Mail



Cornell Group homepage


http://iss.cs.cornell.edu
This work emerged from a joint project with David
Padua’s group at UIUC


[email protected]
[email protected]
http://polaris.cs.uiuc.edu/newframework.html
Download X-Ray!

11/7/2015
http://iss.cs.cornell.edu/software/x-ray.aspx
QEST'05
25