Profiling, Instrumentation, and Profile Based Optimization

Download Report

Transcript Profiling, Instrumentation, and Profile Based Optimization

Profiling, Instrumentation, and
Profile Based Optimization
Robert Cohn
[email protected]
Mark T. Vandevoorde
Introduction
Understanding the dynamic interaction
between programs and processors
– What do programs do?
– How do processors perform?
– How can we make it faster?
10/4/98
Profiling Tutorial
1-2
What to do?
Build tools!
– Profiling
– Instrumentation
– Profile based optimization
10/4/98
Profiling Tutorial
1-3
The Big Picture
Sampling
Instrumentation
Profiling
Analysis
10/4/98
Profile Based
Optimization
Profiling Tutorial
Modeling
1-4
Instrumentation
• User level view
• Executable editing
10/4/98
Profiling Tutorial
1-5
Code Instrumentation
Trojan Horse
TOOL
V
V
• Application appears unchanged
• Data collected as a side effect of execution
10/4/98
Profiling Tutorial
1-6
Instrumentation Example
• Add extra code
if (b > c)
t = 1;
else
b = 3;
10/4/98
if (b > c) {
bb[0]++;
t = 1;
} else {
bb[1]++;
b = 3;
Profiling Tutorial
}
Instrumentation
1-7
Instrumentation Uses
• Profiles
• Model new hardware
– What will this new branch predictor do?
– What is the miss rate of this new cache?
• Optimization opportunities
– find unnecessary loads and stores
– find divides by 1
10/4/98
Profiling Tutorial
1-8
What Tool Does Instrumentation?
• Compiler
– Compiler inserts extra operations
– Requires recompile, access to source code
• Executable editor
– Post-link tool inserts instrumentation code
– No rebuild, source code not required
– More difficult to relate back to source
10/4/98
Profiling Tutorial
1-9
Instrumentation Tools for Alpha
• All executable based
• General instrumentation:
– Atom on Digital Unix
• Distributed with Digital Unix
– Ntatom on Windows NT
• New! Download from web
• Specialized tools based on above
– hiprof, pixie, 3rd, ...
10/4/98
Profiling Tutorial
1-10
ATOM
• Tool for customized instrumentation
• User writes program that describes how to
instrument application
• Instrumentation program applied to
application, generates instrumented
application
• Instrumented application is run
• Data is collected
10/4/98
Profiling Tutorial
1-11
User Supplies
• Instrumentation routines: user written
program that inserts instrumentation
– calls to analysis routines
• Analysis routines: do the instrumentation
work at runtime (e.g. count a basic block)
10/4/98
Profiling Tutorial
1-12
Atom Programming Model
Iterate
spice
libc.so
libm.so
Iterate
main()
Compute()
block2
block1
block4
_exit()
block3
block5
ldq r1, 8(sp) addq r1, 0x1, r2 stq r2, 8(sp) bne r1, 0x1ffc40
10/4/98
Profiling Tutorial
1-13
ATOM Instrumentation API:
Navigation
• Objects (binary, shared library)
– GetFirstObj, GetNextObj
• Procedures
– GetFirstProc, GetNextProc
• Basic blocks
– GetFirstBlock, GetNextBlock
• Instructions
– GetFirstInst, GetNextInst
10/4/98
Profiling Tutorial
1-14
ATOM Instrumentation API:
Interrogation
• GetObjInfo, GetProcInfo, GetBlockInfo,
GetInstInfo
• IsBranchTarget
• GetInstRegUsage
• InstPC
• InstLineNo
• ...
10/4/98
Profiling Tutorial
1-15
ATOM Instrumentation API:
Definition
• AddCallProto
– tells atom the types of the arguments for calls to
analysis routines
10/4/98
Profiling Tutorial
1-16
ATOM Instrumentation API:
Instrumentation
• AddCallProgram, AddCallObj,
AddCallProc, AddCallBlock, AddCallInst,
ReplaceProcedure
• Insert before or after
10/4/98
Profiling Tutorial
1-17
Arguments to analysis routines
• Constants
– variables in instrumentation program, but
constant at instrumentation point
– e.g. uninstrumented PC, function name
• VALUE computed at runtime
– effective address, branch taken predicate
• Register
– r3, arguments, return value
10/4/98
Profiling Tutorial
1-18
Sample #1: Cache Simulator
Write a tool that computes the miss rate of the
application running in a 64KB, direct mapped
data cache with 32 byte lines.
> atom spice cache.inst.o cache.anal.o -o spice.cache
> spice.cache < ref.in > ref.out
> more cache.out
5,387,822,402 620,855,884 11.523%
10/4/98
Profiling Tutorial
1-19
Cache Tool Implementation
Application
Instrumentation
main:
clr
t0
loop:
ldl
addl
addl
stl
bne
ret
10/4/98
VALUE
t2,0(a0)
t0,4,t0
t2,0x10,t2
t2,0(a0)
t3,loop
Reference(0(a0))
Reference (0(a0));
PrintResults();
Profiling Tutorial
1-20
Cache Analysis File
#include <stdio.h>
#define CACHE_SIZE 65536
#define BLOCK_SHIFT 5
long cache[CACHE_SIZE >> BLOCK_SHIFT], refs,misses;
Reference(long address) {
int index = address & (CACHE_SIZE-1) >> BLOCK_SHIFT;
long tag = address >> BLOCK_SHIFT;
if (cache[index] != tag) { misses++; cache[index] = tag ; }
refs++;}
Print() {
FILE *file = fopen("cache.out","w");
fprintf(file,"%ld %ld %.2f\n",refs, misses, 100.0 * misses / refs);
fclose(file);}
10/4/98
Profiling Tutorial
1-21
Cache Instrumentation File
#include <stdio.h>
#include <cmplrs/atom.inst.h>
unsigned Instrument(int argc, char **argv, Obj *o) {
Inst *i;Block *b;Proc *p;
AddCallProto("Reference(VALUE)"); AddCallProto("Print()");
AddCallProgram(ProgramAfter,"Print");
for (p = GetFirstProc(); p != NULL; p = GetNextProc(p))
for (b = GetFirstBlock(p); b != NULL; b = GetNextBlock(b))
for (i = GetFirstInst(b); i != NULL; i = GetNextInst(i))
if (IsInstType(i, InstTypeLoad) || IsInstType(i,InstTypeStore))
AddCallInst(i, InstBefore, "Reference", EffAddrValue);
}
10/4/98
Profiling Tutorial
1-22
Sample #2: Profiler
Write a tool that outputs the address of each basic
block and the number of times it is executed.
vssad-27> atom a.out prof.inst.c prof.anal.c
vssad-28> a.out.atom
Hello world
vssad-29> head prof.out
120001030 1
120001038 1
12000103c 1
120001058 33
120001064 1
10/4/98
Profiling Tutorial
1-23
Profiler Tool Implementation
Application
main:
clr
t0
ldl
addl
addl
stl
bne
ret
t2,0(a0)
t0,4,t0
t2,0x10,t2
t2,0(a0)
t3,loop
loop:
10/4/98
Instrumentation
Init(3)
Count(0)
Constant
Count(1)
Count(2)
PrintResults(addresses,3);
Profiling Tutorial
1-24
Profiler: prof.anal.c
#include <stdio.h>
long * counts;
void Init(int nblocks) {
counts = (long *)malloc(nblocks * sizeof(long));
memset(counts,0,nblocks * sizeof(long));}
void Count(int index){ counts[index]++; }
void Print(long *blocks,int nblocks) {
int i; FILE *file = fopen("prof.out","w");
for (i = 0; i < nblocks; i++)
fprintf(file,"%lx %ld\n",blocks[i],counts[i]);
fclose(file);
}
10/4/98
Profiling Tutorial
1-25
Profiler: prof.inst.c
#include <stdio.h>
#include <cmplrs/atom.inst.h>
void CallInitPrint();
void Instrument(int argc, char **argv,Obj * o) {
Block *b;Proc *p;int index=0;
int nblocks = GetObjInfo(o,ObjNumberBlocks);
long *addresses = (long *)malloc(nblocks * sizeof(long));
CallInitPrint(addresses,nblocks);
for (p = GetFirstProc(); p != NULL; p = GetNextProc(p))
for (b = GetFirstBlock(p); b != NULL; b = GetNextBlock(b)) {
addresses[index] = InstPC(GetFirstInst(b));
AddCallInst(GetFirstInst(b), InstBefore, "Count",index++);
}}
10/4/98
Profiling Tutorial
1-26
Profiler: prof.inst.c
void CallInitPrint(long * addresses, int nblocks)
{
char buffer[100];
AddCallProto("Count(int)");
AddCallProto("Init(int)");
AddCallProgram(ProgramBefore,"Init",nblocks);
sprintf(buffer,"Print(const stable int[%d],int)");
AddCallProto(buffer);
AddCallProgram(ProgramAfter,"Print",addresses,nblocks);
}
10/4/98
Profiling Tutorial
1-27
Executable editors
• Input: executable, ouput: executable
• Instrument, optimize, translate
• Executable = image = binary = shared
library = shared object = dynamically linked
library (DLL)
• Executable editor, executable optimizer,
binary rewriter, binary translator, post link
optimizer
10/4/98
Profiling Tutorial
1-28
Executable Editing
• Insert/delete/reorder instructions and data
• Obstacle to modification
– Addresses are bound
– Registers are bound
10/4/98
Profiling Tutorial
1-29
Obstacles
if (a) a = b;
beq r1,+2
ldl r1,0x1000
lda a0,0x1000
bsr Reference
• Is a0 free?
• Adjust branch offsets
• Adjust literal addresses
10/4/98
Profiling Tutorial
1-30
Phases
1. Decompose
2. Build IR
3. Insert instrumentation
4. Convert IR to executable
10/4/98
Profiling Tutorial
1-31
1. Decompose Executable
Executable
Header
Text (code)
Data
Rdata
Exception Info
Relocations
Debug
10/4/98
Profiling Tutorial
Program
code &
data
Meta
data
1-32
Decompose
• Break executable into units
• unit: minimum data that must be kept
together
• code: unit is instruction
• data: unit is data section
– alternative: unit is data item
10/4/98
Profiling Tutorial
1-33
2. Build Internal Representation
Instruction list
Data sections
add
Data
load
Sdata
beq
MetaData
Exception
Relocations
Info
10/4/98
Profiling Tutorial
1-34
Intermediate Representation
• Similar to compiler
– except unstructured, untyped data
– 1 to 1 mapping for IR and machine instructions
• Base representation should be compact
– fit in physical memory
• initial/final phases do multiple passes
• Representations built/thrown away for
procedures
10/4/98
Profiling Tutorial
1-35
Bound addresses
Data:
1
2
0x12345678
3
10/4/98
Code:
Metadata:
br +4
Begin:
0x12345678
ldah r0,0x1234
lda r0,0x5678(r0) End:
0x12345680
Profiling Tutorial
1-36
Adjusting addresses
• No translation
• Dynamic translation
• Static translation
10/4/98
Profiling Tutorial
1-37
No translation
• Leave code and data at same address
beq
ldl
L2:
r1,L2
r1,0x1234
beq
br
r1,L2
L1
L2:
...
...
L1:
10/4/98
Profiling Tutorial
lda a0,0x1234
bsr Reference
ldl r1,0x1234
br L2
1-38
Dynamic translation
• Address computation is unchanged
• Image has map of old->new address
• Code inserted to map old->new address at
runtime for load/store/branch
• Better:
– Do PC relative branches statically
– Keep data section at original address
– Still: indirect calls and jumps (not returns)
10/4/98
Profiling Tutorial
1-39
Static translation
• Address computation is altered for new
layout
• Find addresses
• Determine what they point to:
– unit, offset
• Insert instrumentation
• Adjust literals or offsets to compute new
address of unit
10/4/98
Profiling Tutorial
1-40
Other tools that change addresses
• Linker
– combine separately compiled objects
– adjust addresses based on assigned load address
– unit is section of object (data, text)
• Loader
– Load address != link address for DLL
– unit is entire image
• Use relocations
10/4/98
Profiling Tutorial
1-41
Relocations
No relocation required
Data:
1
2
0x12345678
3
Code:
May require relocation
br +4
Relocation example:
address: 0x200
ldl r1,10(gp)
type: ldah literal
ldah r0,0x1234 object: 0x12345670
lda r0,0x5678(r0)external:
Requires relocation
10/4/98
Profiling Tutorial
1-42
How to recognize addresses?
• Metadata
– example: procedure begin, procedure end
– implicit in structure of data
• Absolute addresses
– example: literal address in data section
– use relocations
• Relative addresses: address offset
– example: pc relative branch, offset for base pointer
– may not need adjustment,usually no relocation
10/4/98
Profiling Tutorial
1-43
Relative Addresses
• Address computed as offset of another
address
• Address and Address + Offset point to same
unit: ok, unit moved as a unit
• Example:
a->field1
ar[4]
ldl r0,field1(a)
ldl r0,16(ar)
10/4/98
Profiling Tutorial
1-44
Relative Addresses
• Offset spans multiple units
• example:
PC relative branch
br +4
Must be 1 unit
10/4/98
Jump table:
ad = base + i
jmp ad
base:
br l1
br l2
br l3
br l4
Profiling Tutorial
1-45
Map address to unit and offset
Reference -> address
– in code: interpret instructions
br +4
ldah r0,0x1234
lda r0,0x5678(r0)
– in data: data is address
.data
0x12345678
10/4/98
Profiling Tutorial
1-46
Map address to unit and offset
(relocation,address) -> (unit,offset)
– to code: pointer to instruction
– to data: data section and offset
• alternative: data item and offset
– offset = address - unit address
10/4/98
Profiling Tutorial
1-47
3. Insert Instrumentation
Instruction list
add
Data sections
load
Data
load
Sdata
beq
Ndata
MetaData
Exception
Relocations
Info
10/4/98
Profiling Tutorial
1-48
Adding instrumentation code
• Instrumentation requires free registers
– wrapper routine saves and restores registers
beq
r1,+2
save registers
lda a0,0x1000
bsr ra,wrapper
restore registers
ldl
r1,0(r2)
Save registers on stack
bsr ra,Reference
Restore registers
return
Reference
• Local/global/interprocedural analysis finds
free registers
10/4/98
Profiling Tutorial
1-49
4. Convert IR to Executable
Executable
Header
Text
Data
Rdata
Ndata
Exception Info
Relocations
Debug
10/4/98
Profiling Tutorial
Program
code
data
Meta
data
1-50
Profile Based Optimization
Profile based optimization
• Collect profile information
– example: how often basic blocks are executed
• Use profile to guide optimization
– example: inlining
10/4/98
Profiling Tutorial
1-52
Profile based Optimization
• Available on Alpha, MIPS, PA, PPC, Sparc,
x86
• Used in compilers and executable
optimizers
• Spec, products, too.
10/4/98
Profiling Tutorial
1-53
10/4/98
AN
R
Profiling Tutorial
R
LI
48
6
PT
O
IJ
PE
O
G
M
PR
ES
S
G
L
C
R
C
PE
G
TE
X
M
88
K
SI
M
VO
EM
O
1
EG
C
M
P
S
KS
ED
W
O
V
PT
C
C
E
G
E
BA
S
H
SY
C
LI
D
C
SO
2
EL
D
C
C
O
R
D
ST
AT
IO
N
TE
XI
M
EX
U
W
EX
VR
A
R
AC
LS
E
W
IN
SQ
Speedup from code layout
Speedup from code layout
50%
45%
40%
35%
30%
25%
20%
15%
10%
5%
0%
1-54
Register allocation and inlining
90.0%
80.0%
70.0%
Speedup
60.0%
code layout
50.0%
code layout + register +
inlining
40.0%
30.0%
20.0%
10.0%
0.0%
c
s
o
m
.g
si
gc
es
.
r
9
k
6
p
09
88
m
12
o
m
c
4.
9.
2
12
1
10/4/98
1
.li
30
rl
e
p
g
pe
1
.ij
32
1
.
34
rte
o
v
7.
4
1
Profiling Tutorial
x
EC
P
S
t
in
1-55
User level view
Compiler:
• Compile
• Instrument
• Run scenario1
• Run scenario2
• Merge profiles
• Recompile
10/4/98
Executable optimizer:
•
•
•
•
•
Profiling Tutorial
Instrument
Run scenario1
Run scenario2
Merge profiles
Optimize
1-56
Optimization’s sensitivity to
training data
• Experience with varying training
– compiler, spreadsheet, CAD, Spec95
• Some training sets are better than others
• Can find one or a combination that gives
best results in all scenarios
• Sometimes requires tuning of optimizations
10/4/98
Profiling Tutorial
1-57
Types of optimizations
• Enhance conventional optimization with
weights based on profile
• Transformations driven by profile info
• Examples
– Register allocation
– Code layout
– Inlining
10/4/98
Profiling Tutorial
1-58
Register allocation
While (a) { top:
cmpgt a,3,t0
if (a > 3)
b++; brfalse t0,then
addl b,1,b
else
join
c++; br
then:
a--;
ldl
t0,c
}
addl t0,1,t0
stl t0,c
join:
subl a,1,a
brtrue a,top
10/4/98
Profiling Tutorial
• a, b, and c live
for entire loop
• Should b or c
get the last
register?
• Information:
block counts
1-59
Code layout: Reduce the number
of taken branches
• Greedy algorithm, lay out common paths
sequentially
1
1
60
40
• Information:
2
3
40
– flow edge counts 2
60
45
5
45
10/4/98
4
4
7
Profiling Tutorial
55
6
55
6
7
3
5
1-60
Inlining
RtnC
RtnA
1000
0
2
RtnB
RtnD
• Probably no advantage to inline RtnD into
RtnA
• RtnB is almost always called from RtnA
– thus no cache penalty for inlining
• Information: Call edge counts
10/4/98
Profiling Tutorial
1-61
Information to drive optimization
Basic:
– basic block counts
– flow edge counts
– call edge counts
More advanced:
– path profiles
– cache misses
– branch mispredicts
10/4/98
Profiling Tutorial
1-62
Computing basic block counts
• Instrumentation
– Use atom tool
– Use 64 bit integers
• Sampling
10/4/98
Profiling Tutorial
1-63
Computing call edges
rtnc
rtna
rtnd
rtna:
move 1,a0
move 2,a1
bsr rtnb
10/4/98
rtnb
rtna:
PC relative call:
move 1,a0
Call edge count is
move 2,a1
same as basic block count
ldl r0,20(t0)
Indirect call: keep hash
jsr r0
table of targets and counts
Profiling Tutorial
1-64
Computing flow edge counts
from basic block counts
• Basic block count
= Σ incoming edges
= Σ outgoing edges
• Exceptions,
longjmp/setjmp are
implicit edges
• Tolerate inconsistencies
10/4/98
Profiling Tutorial
10
10
30
20
10
10
1-65
Computing flow edge counts
from basic block counts
• Some graphs have
multiple solutions
• Guess!
• Instrument edges
• Instrument
minimum number
of blocks and
edges
10/4/98
while (a) a--;
10
1
9
bzero a,skip
top:subl a,1,a
bnzero a,top
skip:
19
20
1
10
10
9
1 20
11
9
10
Two solutions for same bb count
Profiling Tutorial
1-66
Computing flow edge counts
from basic block counts
• Spanning tree algorithm
– given flow graph, costs, finds lowest cost set of
instrumentation points
– costs derived from static analysis or earlier runs
• Read Ball and Larus for details
10/4/98
Profiling Tutorial
1-67
Instrumenting flow edges
• ATOM: branch taken value can be passed to
analysis routine
• branch not taken: insert call to count after
conditional branch
• taken branch, indirect jump: insert new
basic block between branch and target
10/4/98
Profiling Tutorial
1-68
Merging multiple profiles
• Multiple runs generate multiple profiles,
how do we combine them?
– Add them together
– Should the profiles be weighted equally?
• User defined
• Scale so that sums are equal
10/4/98
Profiling Tutorial
1-69
Using profiles
• Edge, block counts are in database
• For each procedure, compiler locates
counts in database and copies them to IR
• Every flow edge, call edge, block labeled
with execution count
• Optimizations that modify flow graph must
update profile information
10/4/98
Profiling Tutorial
1-70
IR/Profiled program mismatch
• Does the flow graph of the program you
profiled match the flow graph in the
compiler IR?
– Optimization
– Code generation
IR
Executable
• Usually ok if you disable optimization
• Not a problem for executable optimizers
10/4/98
Profiling Tutorial
1-71
Persistence
• If the program is modified, can you use an
old profile?
• Generating a profile can be difficult and
time consuming
• Don’t hold up build process generating a
new profile every time
10/4/98
Profiling Tutorial
1-72
Usability
• Make it easy or no one will use it
• Limited changes to build process
• Limited opportunities for user to mess up
10/4/98
Profiling Tutorial
1-73
Profile based optimization
nirvana
• Profile any build
– tolerate IR/profiled program mismatches
•
•
•
•
No instrumentation step
Low cost profiling, < 5%
No restructuring of makefile
Big speedup!
10/4/98
Profiling Tutorial
DCPI
1-74
Tools for profile based
optimization
• Unix
– cc, f77
– om: executable optimizer called from cc
– cord: user specified procedure ordering
• NT
– scc: calls Visual C
– spike: executable optimizer
– link /order: user specified procedure ordering
• wstune generates ordering
10/4/98
Profiling Tutorial
1-75