Transcript Pin-pldi05

Pin
Building Customized Program Analysis Tools with
Dynamic Instrumentation
CK Luk, Robert Cohn, Robert Muth, Harish Patil,
Artur Klauser, Geoff Lowney, Steven Wallace, Kim Hazelwood
Intel
Vijay Janapa Reddi
University of Colorado
http://rogue.colorado.edu/Pin
PLDI’05
1
Instrumentation
• Insert extra code into programs to collect
information about execution
– Program analysis:
• Code coverage, call-graph generation, memory-leak detection
– Architectural study:
• Processor simulation, fault injection
• Existing binary-level instrumentation systems:
– Static:
• ATOM, EEL, Etch, Morph
– Dynamic:
• Dyninst, Vulcan, DTrace, Valgrind, Strata, DynamoRIO
C Pin is a new dynamic binary instrumentation system
PLDI’05
2
Advantages of Pin Instrumentation
1. Easy-to-use Instrumentation API
–
–
Instrumentation code written in C/C++/asm
ATOM-like API, based on procedure calls
2. Instrumentation tools portable across platforms
–
–
Same tools work on IA32, EM64T (x86-64), Itanium, ARM
Same tools work on Linux and Windows (ongoing work)
3. Low instrumentation overhead
–
–
Pin automatically optimizes instrumentation code
Pin can attach instrumentation to a running process
4. Robust
–
Handle mixed code and data, variable-length instructions,
dynamically-generated code
5. Transparent
–
Application sees original addresses, values, and stack content
PLDI’05
3
A Pintool for Tracing Memory Writes
#include <iostream>
#include "pin.H"
FILE* trace;
executed immediately
before a write is executed
• Same source code
works
on thesize)
4 architectures
VOID RecordMemWrite(VOID*
ip, VOID*
addr, UINT32
{
fprintf(trace, “%p: W %p %d\n”, ip, addr, size);
}
=> Pin takes care of different addressing modes
VOID Instruction(INS
*v) {
• No needins,
to VOID
manually
save/restore application state
if (INS_IsMemoryWrite(ins))
INS_InsertCall(ins, IPOINT_BEFORE, AFUNPTR(RecordMemWrite),
=> Pin does
it for you automatically
and efficiently
IARG_INST_PTR,
IARG_MEMORYWRITE_EA,
IARG_MEMORYWRITE_SIZE,
IARG_END);
}
executed when an instruction
int main(int argc, char * argv[]) {
PIN_Init(argc, argv);
is dynamically compiled
trace = fopen(“atrace.out”, “w”);
INS_AddInstrumentFunction(Instruction, 0);
PIN_StartProgram();
PLDI’05
4
return 0;
}
Dynamic Instrumentation
Original code
Code cache
1’
1
2
3
5
Exits point back to Pin
2’
4
7’
6
7
PLDI’05
Pin
Pin fetches trace starting block 1
and start instrumentation
5
Dynamic Instrumentation
Original code
Code cache
1’
1
2
3
5
2’
4
7’
6
7
Pin transfers control into
code cache (block 1)
PLDI’05
Pin
6
Dynamic Instrumentation
Original code
Code cache
trace linking
1
2
3
5
1’
3’
2’
5’
7’
6’
4
6
7
PLDI’05
Pin fetches and instrument
a new trace
Pin
7
Pin’s Software Architecture
Address space
Pintool
Pin
 3 programs (Pin, Pintool, App) in
same address space:
 User-level only
Instrumentation APIs
 Instrumentation APIs:
Application
Virtual Machine (VM)
JIT Compiler
Code  JIT compiler:
Cache
Emulation Unit
Operating System
Hardware
PLDI’05
 Through which Pintool
communicates with Pin
 Dynamically compile and
instrument
 Emulation unit:
 Handle insts that can’t be
directly executed (e.g., syscalls)
 Code cache:
 Store compiled code
=> Coordinated by VM
8
Pin Internal Details
•
•
•
•
•
Loading of Pin, Pintool, & Application
An Improved Trace Linking Technique
Register Re-allocation
Instrumentation Optimizations
Multithreading Support
PLDI’05
9
Register Re-allocation
•
Instrumented code needs extra registers. E.g.:
–
–
–
•
Virtual registers available to the tool
A virtual stack pointer pointing to the instrumentation stack
Many more …
Approaches to get extra registers:
1. Ad-hoc (e.g., DynamoRIO, Strata, DynInst)
–
Whenever you need a register, spill one and fill it afterward
2. Re-allocate all registers during compilation
a. Local allocation (e.g., Valgrind)
–
Allocate registers independently within each trace
b. Global allocation (Pin)
–
PLDI’05
Allocate registers across traces (can be inter-procedural)
10
Valgrind’s Register Re-allocation
Trace 1
Original Code
mov 1, %eax
mov 1, %eax
mov 2, %ebx
mov 2, %esi
cmp %ecx, %edx
re-allocate
jz t
cmp %ecx, %edx
Virtual
Physical
mov %eax, SPILLeax
%eax
%ebx
%ecx
%edx
%eax
%esi
%ecx
%edx
mov SPILLeax, %eax
Virtual
Physical
mov SPILLebx ,%edi
%eax
%ebx
%ecx
%edx
%eax
%edi
%ecx
%edx
mov %esi, SPILLebx
t:
jz t’
add 1, %eax
sub 2, %ebx
Trace 2
t’:
C Simple but inefficient
add 1, %eax
sub 2, %edi
• All modified registers are spilled at a trace’s end
PLDI’05
• Refill registers at a trace’s beginning
11
Pin’s Register Re-allocation
Scenario (1): Compiling a new trace at a trace exit
Trace 1
Original Code
mov 1, %eax
mov 1, %eax
mov 2, %ebx
mov 2, %esi
cmp %ecx, %edx
re-allocate
cmp %ecx, %edx
jz t’
jz t
t:
add 1, %eax
sub 2, %ebx
Trace 2
t’:
add 1, %eax
Compile Trace 2 using the
binding at Trace 1’s exit:
Virtual
Physical
%eax
%ebx
%ecx
%edx
%eax
%esi
%ecx
%edx
sub 2, %esi
PLDI’05
C No spilling/filling needed across traces
12
Pin’s Register Re-allocation
Scenario (2): Targeting an already generated trace at a trace exit
Trace 1 (being compiled)
Original Code
mov 1, %eax
mov 1, %eax
mov 2, %ebx
mov 2, %esi
re-allocate
cmp %ecx, %edx
cmp %ecx, %edx
jz t
mov %esi, SPILLebx
mov SPILLebx, %edi
t:
jz t’
add 1, %eax
sub 2, %ebx
Physical
%eax
%ebx
%ecx
%edx
%eax
%esi
%ecx
%edx
Trace 2 (in code cache)
t’:
PLDI’05
Virtual
add 1, %eax
Virtual
Physical
sub 2, %edi
%eax
%ebx
%ecx
%edx
%eax
%edi
%ecx
%edx
C Minimal spilling/filling code
13
Instrumentation Optimizations
1. Inline instrumentation code into the application
2. Avoid saving/restoring eflags with liveness analysis
3. Schedule inlined instrumentation code
PLDI’05
14
Example: Instruction Counting
Original code
cmov %esi, %edi
cmp %edi, (%esp)
jle <target1>
add %ecx, %edx
cmp %edx, 0
je <target2>
BBL_InsertCall(bbl, IPOINT_BEFORE, docount(),
IARG_UINT32, BBL_NumIns(bbl),
IARG_END)
C 33 extra instructions executed altogether
Instrument without applying any optimization
bridge()
Trace
mov %esp,SPILLappsp
mov SPILLpinsp,%esp
call <bridge>
cmov %esi, %edi
mov SPILLappsp,%esp
cmp %edi, (%esp)
jle <target1’>
mov %esp,SPILLappsp
mov SPILLpinsp,%esp
call <bridge>
add %ecx, %edx
PLDI’05
cmp
%edx, 0
je <target2’>
pushf
push %edx
push %ecx
push %eax
movl 0x3, %eax
call docount
pop %eax
pop %ecx
pop %edx
popf
ret
docount()
add %eax,icount
ret
15
Example: Instruction Counting
Original code
cmov %esi, %edi
cmp %edi, (%esp)
jle <target1>
Inlining
add %ecx, %edx
cmp %edx, 0
je <target2>
C 11 extra instructions executed
PLDI’05
Trace
mov %esp,SPILLappsp
mov SPILLpinsp,%esp
pushf
add 0x3, icount
popf
cmov %esi, %edi
mov SPILLappsp,%esp
cmp %edi, (%esp)
jle <target1’>
mov %esp,SPILLappsp
mov SPILLpinsp,%esp
pushf
add 0x3, icount
popf
add %ecx, %edx
cmp %edx, 0
je <target2’>
16
Example: Instruction Counting
Original code
cmov %esi, %edi
cmp %edi, (%esp)
jle <target1>
Inlining + eflags liveness analysis
add %ecx, %edx
cmp %edx, 0
je <target2>
C 7 extra instructions executed
Trace
mov %esp,SPILLappsp
mov SPILLpinsp,%esp
pushf
add 0x3, icount
popf
cmov %esi, %edi
mov SPILLappsp,%esp
cmp %edi, (%esp)
jle <target1’>
add 0x3, icount
add %ecx, %edx
cmp %edx, 0
je <target2’>
PLDI’05
17
Example: Instruction Counting
Original code
cmov %esi, %edi
cmp %edi, (%esp)
jle <target1>
Inlining + eflags liveness analysis + scheduling
add %ecx, %edx
cmp %edx, 0
je <target2>
C 2 extra instructions executed
Trace
cmov %esi, %edi
add 0x3, icount
cmp %edi, (%esp)
jle <target1’>
add 0x3, icount
add %ecx, %edx
cmp %edx, 0
je <target2’>
PLDI’05
18
Pin Instrumentation Performance
Runtime overhead of basic-block counting with Pin on IA32
Average Slowdown
Without optimization
Inlining
Inlining + eflags liveness analysis
Inlining + eflags liveness analysis + scheduling
11
10
9
8
7
6
5
4
3
2
1
0
10.4
7.8
2.8 2.5
SPECINT
PLDI’05
3.9 3.5
1.5 1.4
SPECFP
(SPEC2K using reference data sets)
19
Comparison among Dynamic Instrumentation Tools
Runtime overhead of basic-block counting with three different tools
Average Slowdown
Valgrind
9
8
7
6
5
4
3
2
1
0
DynamoRIO
Pin
8.3
5.1
2.5
SPECINT
• Valgrind is a popular instrumentation tool on Linux
• Call-based instrumentation, no inlining
• DynamoRIO is the performance leader in binary dynamic optimization
• Manually inline, no eflags liveness analysis and scheduling
PLDI’05
20
C
Pin automatically provides efficient instrumentation
Pin Applications
• Sample tools in the Pin distribution:
– Cache simulators, branch predictors, address tracer, syscall tracer,
edge profiler, stride profiler
• Some tools developed and used inside Intel:
– Opcodemix (analyze code generated by compilers)
– PinPoints (find representative regions in programs to simulate)
– A tool for detecting memory bugs
• Some companies are writing their own Pintools:
– A major database vendor, a major search engine provider
• Some universities using Pin in teaching and research:
– U. of Colorado, MIT, Harvard, Princeton, U of Minnesota,
Northeastern, Tufts, University of Rochester, …
PLDI’05
21
Conclusions
• Pin
– A dynamic instrumentation system for building your own
program analysis tools
– Easy to use, robust, transparent, efficient
– Tool source compatible on IA32, EM64T, Itanium, ARM
– Works on large applications
• database, search engine, web browsers, …
– Available on Linux; Windows version coming soon
• Downloadable from http://rogue.colorado.edu/Pin
– User manual, many example tools, tutorials
– 3300 downloads since 2004 July
PLDI’05
22
Acknowledgments
• Prof Dan Connors
– Hosting Pin website at U of Colorado
• Intel Bistro Team
– Providing the Falcon decoder/encoder
– Suggesting instrumentation scheduling
• Mark Charney
– Providing the XED decoder/encoder
• Ramesh Peri
– Implementing part of Itanium Instrumentation
PLDI’05
23
Backup
PLDI’05
24
Talk Outline
•
•
•
•
•
A Sample Pintool
Pin Internal Details
Experimental Results
Pin Applications
Conclusions
PLDI’05
25
Trace Linking
• Trace linking is a very effective optimization
– Bypass VM when transferring from one trace to another
– Slowdown without trace linking as much as 100x
• Linking direct branches/calls
– Straightforward as targets are unique
• Linking indirect branches/calls & returns
– More challenging because the target can be different
each time
– Our approach:
• For all indirect control transfers, use chaining
• For returns, further optimizes with function cloning
PLDI’05
26
Indirect Trace Linking
original indirect jump
jmp [%eax]
chain of predicted targets
target_1’:
mov [%eax], T
jmp target_1’
if (T != target_1)
jmp target_2’
…
target_N’:
if (T != target_N)
jmp LookupHtab
…
• Chains are built incrementally
LookupHtab:
if (hit)
jmp translated[T]
else
call Pin
slow path
– Most recent target inserted at the chain’s head
• Hash table is local to each indirect jump
C Improved prediction accuracy over existing schemes
PLDI’05
27
Return-Address Prediction
• Distinguish different callers to a function by cloning:
F’():
pop T
A():
jmp A’
call F()
F():
ret
B():
call F()
F_A’()
:
pop T
jmp A’
F_B’()
:
pop T
jmp B’
PLDI’05
A’:
B’:
if (T != A)
jmp B’
…
if (T != B)
jmp Lookuphtab1
…
A’:
if (T != A)
jmp Lookuphtab1
…
B’:
if (T != B)
jmp Lookuphtab2
…
C Prediction accuracy further improved
28
Pin Multithreading Support
• For instrumenting multithreaded programs:
– Pin intercepts all threading-related system calls:
• Create and start jitting a thread if a clone() is seen
– Pin provides a “thread id” for pintools to index threadlocal storage
– Pin’s virtual registers are backed up by per-thread
spilling area
• For writing multithreaded pintools:
– Since Pin cannot link in libpthread in the pintool (to
avoid conflicts in setting up signal handlers by two
libpthreads)
PLDI’05
• Pin implements a subset of libpthread itself
• Pin can also redirect libpthread calls in pintool to the
application’s libpthread
29
Instrumenting Multithreaded Programs
• Pin instruments multithreaded programs:
– Spilling area has to be thread local
• Create a new per-thread spilling area when a thread-create system
call (e.g., clone()) is intercepted
• How to access to per-thread spilling area?
– Steal a physical register to point to the per-thread spilling area
– x86-specific optimization:
• Initially assuming single-threaded program
– Access to the spilling area via its absolute address
• If multiple threads detected later:
– Flush the code cache
– Recompile with a physical register pointing to per-thread spilling area
PLDI’05
30
Optimizing Instrumentation Performance
Observations:
– Slowdown largely due to executing instrumentation
code rather than dynamic compilation
 Make sense to spend more time to optimize
– Focus on optimizing simple instrumentation tools:
• Performance depends on how fast we can transit
between the application and the tool
• Simple yet commonly used (e.g., basic-block profiling)
PLDI’05
31
Pin Source Code Organization
• Pin source organized into generic, architecturedependent, OS-dependent modules:
Architecture
#source files
#source lines
Generic
87 (48%)
53595 (47%)
x86 (32-bit + 64-bit)
34 (19%)
22794 (20%)
Itanium
34 (19%)
20474 (18%)
ARM
27 (14%)
17933 (15%)
TOTAL
182 (100%)
114796 (100%)
C ~50% code shared among architectures
PLDI’05
32
Pin Instrumentation Performance
2000
Without optimization
Inlining
Inlining + eflags liveness analysis
Inlining + eflags liveness analysis + scheduling
1500
138
317
104
110
105
149
110
109
144
127
152
110
114
121
118
248
162
134
289
450
179
108
189
412
259
214
236
500
343
1000
Average slowdown
INT
FP
Without optimization
10.4x
3.9x
Inlining
7.8x
3.5x
Inlining + eflags analysis
2.8x
1.5x
2.5x
1.4x
PLDI’05
Inlining + eflags analysis + scheduling
33
FP-AriMean
wupwise
swim
sixtrack
mgrid
mesa
lucas
galgel
fma3d
facerec
equake
art
apsi
applu
ammp
INT-AriMean
vpr
vortex
twolf
perlbmk
parser
mcf
gzip
gcc
gap
eon
crafty
0
bzip2
Normalized Execution Time (%)
Performance of basic-block counting with Pin/IA32
Comparison among Dynamic Instrumentation Tools
Performance of basic-block counting with three different tools
DynamoRIO
Pin/IA32
834
251
320
162
Tr
Ar
iM
ea
n
289
vp
IN
x
vo
rte
ol
f
tw
• Valgrind is a popular instrumentation tool on Linux
508
817
391
269
134
450
520
793
rlb
pe
rs
e
r
m
k
179
191
158
108
pa
m
cf
ip
gz
gc
c
189
259
412
574
480
934
718
633
860
p
ga
cr
af
ty
236
343
606
617
582
479
ip
2
936
1220
1583
1091
1600
1400
1200
1000
800
600
400
200
0
bz
Normalized Execution Time (%)
Valgrind
• Call-based instrumentation, no inlining
• DynamoRIO is the performance leader in dynamic optimization
• Manually inline, no eflags liveness analysis and scheduling
PLDI’05
34
C
Pin automatically provides efficient instrumentation
bz
ip
2
cr
af
ty
eo
n
ga
p
gc
c
gz
ip
m
c
pa f
rs
pe e r
rlb
m
k
tw
o
vo lf
rte
x
IN
T- v p
A
rt M r
ea
am n
m
ap p
pl
u
ap
si
a
eq rt
ua
fa ke
ce
re
fm c
a3
d
ga
lg
e
lu l
ca
s
m
es
a
m
gr
s i id
xt
ra
ck
sw
w
u im
FP pwi
-A se
riM
ea
n
150
PLDI’05
101
111
100
104
101
VM
106
101
103
JIT-Other
121
103
104
JIT-Regalloc
101
102
JIT-Decode
109
198
237
250
114
115
Code Cache
100
100
101
299
300
111
156
182
200
122
108
Normalized Execution Time (%)
Pin/IA32 Performance (no instrumentation)
Total
400
350
154
105
50
0
35
PLDI’05
0
103
101
102
104
apsi
art
equake
facerec
101
105
101
103
101
106
lucas
mesa
mgrid
sixtrack
swim
wupwise
VM
FP-AriMean
104
galgel
JIT-Other
111
101
applu
JIT-Regalloc
fma3d
101
50
ammp
110
JIT-Decode
INT-AriMean
vpr
183
200
vortex
111
296
300
twolf
112
376
Code Cache
perlbmk
parser
107
100
100
mcf
gzip
148
144
159
400
gcc
gap
105
150
eon
crafty
bzip2
Normalized Execution Time (%)
Pin/EM64T Performance (no instrumentation)
Total
350
250
163
104
36
IN
TA
PLDI’05
ea
n
m
p
ap
pl
u
ap
si
a
eq r t
ua
k
fa e
ce
re
c
fm
a3
d
ga
lg
el
lu
ca
s
m
es
a
m
g
si rid
xt
ra
ck
sw
w im
F P upw
-A is
riM e
ea
n
am
riM
r
125
128
142
115
112
132
104
100
117
135
113
99
100
109
114
126
142
125
125
105
250
210
260
300
120
173
200
133
122
357
400
vp
af
ty
eo
n
ga
p
gc
c
gz
ip
m
p a cf
rs
pe er
rlb
m
k
tw
o
vo lf
rte
x
ip
2
150
cr
bz
Normalized Execution Time (%)
Pin0/IPF Performance (no instrumentation)
Total
350
167
119
50
0
37