Transcript powerpoint

Fine-Grained Dynamic
Instrumentation of Commodity
Operating System Kernels
Ariel Tamches
Barton P. Miller
{tamches,bart}@cs.wisc.edu
Computer Sciences Department
University of Wisconsin
1210 W. Dayton Street
Madison, WI 53706-1685
USA
© 1999 Ariel Tamches
February 19, 1999
OSDI ‘99
The Vision
A unified infrastructure for dynamic OS’s
Fine-grained runtime code instrumentation for:
–
–
–
–
–
Performance measurement
Tracing
Testing (e.g., code coverage)
Debugging: conditional breaks, access checks
Optimizations: specialization, code
reorganization
– Extensibility
February 19, 1999
–2 of 19–
OSDI ‘99
Motivation: Measurement
• Measurement primitives
– Counts, cycle timers, cache misses, branch
misprediction time (on-chip counters)
• Instrument kernel to self-measure as it runs
• Predicates
– A specific code path; when a process is running, etc.
• Many interesting routines in the kernel:
– Scheduling: preempt, disp, swtch
– VM management: hat_chgprot, hat_swapin
– Network: tcp_lookup, tcp_wput, ip_csum_hdr, hmeintr
February 19, 1999
–3 of 19–
OSDI ‘99
Motivation: Optimization
• Performance measurement shows slow code?
Pick from a cookbook of on-line optimizations
– Specialization
•
•
•
•
Instrument function to find common params
Generate specialized function
Install (old version jumps to new if condition met)
Can predicate specialization (e.g. a specific process)
– Reorganize code to improve i-cache
• Instrument function to measure icache miss cycles
• Then instrument to find cold basic blocks
• Generate “outlined” function & install
February 19, 1999
–4 of 19–
OSDI ‘99
Technology to Make it Happen
KernInst: fine-grained dynamic kernel
instrumentation
• Inserts runtime-generated code into kernel
• Dynamic: everything at runtime
– no recompile, reboot, or even pause
• Fine-grained: insert at instruction granularity
• Runs on unmodified commodity kernel
– Solaris on UltraSparc
February 19, 1999
–5 of 19–
OSDI ‘99
Dynamic Instrumentation
some_kernel_func()
Code Patch
instruc1
instruc2
branch
instruc4
instruc19
instruc20
runtime-generated code
equivalent of instruc3
branch
Net effect: desired code is
inserted before instruc3
• Insert any code, almost anywhere (finegrained), entirely at runtime (dynamic)
February 19, 1999
–6 of 19–
OSDI ‘99
Our System: KernInst
Kerninst Tools
(kernel profiler, tracer, optimizer,...)
Instrumentation request
kerninstd
ioctl()
Patch Heap
/dev/kerninst
February 19, 1999
Data Heap
Kernel Space
–7 of 19–
OSDI ‘99
How KernInst Works
kerninstd startup:
– Installs the KernInst driver, /dev/kerninst
– Allocates patch and data heaps, and reads kernel
symbol table (with assistance from /dev/kerninst)
– Parses kernel code into CFG
• Tools may want to identify basic blocks
– Finds unused registers
• Inserted code will use these registers (avoid spills)
• From an interprocedural data-flow analysis on the CFG
– 15 seconds
February 19, 1999
–8 of 19–
OSDI ‘99
How KernInst Works (2)
• To splice in instrumentation code, kerninstd:
– Allocates code patch
– Fills code patch with instrumentation code,
overwritten instruction, and a jump back
– Overwrites instruction at instrumentation point
with a branch to the code patch
• Writing to kernel memory
– /dev/kmem works for most of the kernel
– Have /dev/kerninst map into D-TLB for nucleus
February 19, 1999
–9 of 19–
OSDI ‘99
Code Splicing Hazard
Jumping to the patch using two instructions:
Before
ORIG 1
ORIG 2
After
Kernel thread
is executing
here
NEW 1
NEW 2
Thread is
(still)
executing
here
crash!
Execution sequence: (ORIG1, NEW2)
• Cannot pause kernel to check for hazard
• Splicing must replace only one instruction!
February 19, 1999
–10 of 19–
OSDI ‘99
Code Splicing: Reach Problem
• Tough to reach patch with just 1 instruction!
– Usually too far from the instrumentation point.
– SPARC branch instruction has only +/- 8MB
offset (ba,a <offset>)
• General solution: springboards
Springboard
instruc
instruc
branch
instruc
February 19, 1999
Long jump
(using as many
insns as needed)
–11 of 19–
Code Patch
(as usual)
OSDI ‘99
Springboard Heap
• Any scratch space located close to the splice
point is suitable for a springboard
– Must be reachable by the 1-instruction branch
– Kernel modules have initialization and
termination routines that can be overwritten
• _init and _fini on SVR4
• Ideal because located throughout the kernel
• Lock modules into memory for safety
– Other unused space in the kernel:
• _start and main are only used when booting
February 19, 1999
–12 of 19–
OSDI ‘99
Web Proxy Server Measurement
• Simple kernel measurement tool
– Number of calls made to a kernel function
– Number of kernel threads executing within a
kernel function (“concurrency”)
• Squid v1.1.22 http proxy server
– Caches HTTP objects in memory and on disk
– We used KernInst to understand the cause of
two Squid disk I/O bottlenecks.
February 19, 1999
–13 of 19–
OSDI ‘99
Web Proxy Server Measurement
• Profile of the kernel open() routine
• Called 20-25 times/sec; taking 40% of time!
February 19, 1999
–14 of 19–
OSDI ‘99
• open() calling vn_create; has 2 sub-bottlenecks:
– lookuppn (a.k.a. namei): path name translation (20%)
– ufs_create: file create on local disk (20%)
February 19, 1999
–15 of 19–
OSDI ‘99
File Creation Bottleneck
• How Squid manages its on-disk cache:
– 1 file per cached HTTP object
– A fixed hierarchy of cache files
– Stale cache files overwritten
• File creation bottleneck
– Overwriting existing files: truncates first
– UFS semantics: meta-data changed synchronously
February 19, 1999
–16 of 19–
OSDI ‘99
File Creation Optimization
• Overwrite cache file; truncate only if needed
• What took 20% now takes 6%
February 19, 1999
–17 of 19–
OSDI ‘99
Conclusion
Fine-grained dynamic kernel instrumentation
is feasible on an unmodified commodity OS
A single infrastructure for
– Profiling, debugging, code coverage
– Optimizations
– Extensibility
The foundation for an evolving OS
Measures and constantly adapts itself to runtime
usage patterns
February 19, 1999
–18 of 19–
OSDI ‘99
The Big Picture
http://www.cs.wisc.edu/paradyn
February 19, 1999
–19 of 19–
OSDI ‘99
(Backup slides follow)
February 19, 1999
–20 of 19–
OSDI ‘99
Time Spent Demuxing TCP
Packets
tcp_lookup()
Patch Area
Data Area
start timer
displaced code
time_tcp_lookup
Start
timer
stop timer
displaced
displaced code
code
February 19, 1999
–21 of 19–
OSDI ‘99
Kernel Metrics
• Number of preemptions
Kernel Code
preempt()
February 19, 1999
Code Patch Area
num_preempt++
displaced code
–22 of 19–
Timers & Counters
Area
num_preempt
OSDI ‘99
Kernel Metrics
• Number of preemptions of process foo
Kernel Code
preempt()
February 19, 1999
Code Patch Area
if curthread->
t_procp == foo
num_preempt++
displaced code
–23 of 19–
Timers & Counters
Area
num_preempt
OSDI ‘99
Kernel Metrics
• Time spent filtering incoming TCP packets
Kernel Code
tcp_lookup()
Code Patch Area
start timer
displaced code
Timers & Counters
Area
time_tcp_lookup
Start timer
stop timer
displaced code
displaced code
February 19, 1999
–24 of 19–
OSDI ‘99
Kernel Metrics
• Virtual time spent allocating kernel memory
kmem_alloc()
swtch()
if in_proc_P
start timer
displaced code
if in_proc_P
stop timer
displaced code
timer
if leaving P
stop timer
if starting P
start timer
displaced code
February 19, 1999
–25 of 19–
OSDI ‘99
Example: Specialization
• Profile:
kmem_alloc()
get size parameter
numcalls[size]++;
displaced code
numcalls[]
hash tab le
• Decision: examine hash table
• Generate specialized version:
– choose fixed value & run constant propagation
– expect unconditional branches & dead code
February 19, 1999
–26 of 19–
OSDI ‘99
Example: Specialization
• Splice in the specialized version:
kmem_alloc()
if size==value then
displaced code
specialized
version
• Patch calls to kmem_alloc
– Detect constant values for size, where possible
– If specialized version appropriate, patch call
• No overhead in this case
February 19, 1999
–27 of 19–
OSDI ‘99