Rapid Identification of Architectural Bottlenecks via Precise Event Counting John Demme, Simha Sethumadhavan Columbia University {jdd,simha}@cs.columbia.edu.

Download Report

Transcript Rapid Identification of Architectural Bottlenecks via Precise Event Counting John Demme, Simha Sethumadhavan Columbia University {jdd,simha}@cs.columbia.edu.

Rapid Identification of Architectural
Bottlenecks via Precise Event Counting
John Demme, Simha Sethumadhavan
Columbia University
{jdd,simha}@cs.columbia.edu
2002
Objective-C
Scheme
Language Popularity
Platforms
C#
Lisp
Python
Delphi
Javascript
Other
Java
PHP
Perl
C
Visual Basic
C++
Source: TIOBE Index http://www.tiobe.com/index.php/tiobe_index
CASTL: Computer Architecture and
Security Technologies Lab
2
2011
Language Popularity
Platforms
Go
Other
Lua
Java
Ruby
Objective-C
C#
C
Scheme
Ada
Lisp
Python
Delphi
Moore’s Law
C++
PHP
Javascript
Perl
Visual Basic
Multicore
Source: TIOBE Index http://www.tiobe.com/index.php/tiobe_index
CASTL: Computer Architecture and
Security Technologies Lab
3
HOW CAN WE POSSIBLY
KEEP UP?
CASTL: Computer Architecture and
Security Technologies Lab
4
Architectural Lifecycle
Performance
Data
Collection
Architectural
Improvement
CASTL: Computer Architecture and
Security Technologies Lab
Human
Analysis
5
Performance Data Collection
• Analytical Models
– Fast, but questionable accuracy
• Simulation
– Often the gold standard
– Very detailed information
– Very slow
• Production Hardware (performance counters)
– Very fast
– Not very detailed
CASTL: Computer Architecture and
Security Technologies Lab
6
Performance Data Collection
• Analytical Models
– Fast, but questionable accuracy
• Simulation
– Often the gold standard
– Very detailed information
– Very slow
• Production Hardware (Performance Counters)
– Very fast
– Not very detailed
– Relatively detailed
CASTL: Computer Architecture and
Security Technologies Lab
7
ACCURACY, PRECISION &
PERTURBATION
A comparison of performance monitoring techniques
and the uncertainty principal
CASTL: Computer Architecture and
Security Technologies Lab
8
Accuracy, Precision & Perturbation
Normal Program Execution
Corresponding Machine State (Cache, Branch Predictor, etc)
Time
• In normal execution, program interacts with
microarchitecture as expected
CASTL: Computer Architecture and
Security Technologies Lab
9
Precise Instrumentation
Monitored Program Execution
Measured Machine State (Cache, Branch Predictor, etc)
Start of
Start of
Start of
mutex_lock
mutex_unlock
barrier_wait
“Correct” Machine State (Cache, Branch Predictor, etc)
Time
• When instrumentation is inserted, the
machine state is disrupted and
measurements are inaccurate
CASTL: Computer Architecture and
Security Technologies Lab
10
Performance Counter SW Landscape
Precise
Reads counters whenever
program or instrumentation
requests a read
Heavyweight
Examples
Overhead
• PAPI
• perf_event
• Proportional
to # of reads
• PAPI: 1048ns
• Perf_event:
262ns
CASTL: Computer Architecture and
Security Technologies Lab
11
Sampling vs. Instrumentation
Traditional Instrumented Program Execution
Start of
mutex_lock
Start of
mutex_unlock
Start of
barrier_wait
Sampled Program Execution
n cycles
n cycles
Time
• Traditional instrumentation like polling
• Sampling uses interrupts
CASTL: Computer Architecture and
Security Technologies Lab
12
Performance Counter SW Landscape
Sampling
Interrupts every n
cycles and
extrapolates
Precise
Reads counters whenever
program or instrumentation
requests a read
Heavyweight
Examples • vTune
• OProfile
Overhead • Inversely
proportional to n
• Up to 20%
• Usually much less
• PAPI
• perf_event
• Proportional
to # of reads
• PAPI: 1048ns
• Perf_event:
262ns
CASTL: Computer Architecture and
Security Technologies Lab
13
The Problem with Sampling
40 if (info->s->concurrent_insert)
rw_rdlock(&info->s->
41
key_root_lock[inx]);
42 changed=_mi_test_if_changed(info);
43 if (!flag) {
switch(info->s->
44
keyinfo[inx].key_alg) {
/* 37 lines omitted */
82 }
84 if (info->s->concurrent_insert) {
if (!error) {
85
while (...) {
86
/* 10 lines omitted */
}
97
}
98
rw_unlock(&info->s->
99
key_root_lock[inx]);
100 }
Sample Interrupt
Is this a critical section?
Conditional Locks
CASTL: Computer Architecture and
Security Technologies Lab
14
Corrected with Precision
40 if (info->s->concurrent_insert)
rw_rdlock(&info->s->
41
key_root_lock[inx]);
42 changed=_mi_test_if_changed(info);
43 if (!flag) {
switch(info->s->
44
keyinfo[inx].key_alg) {
/* 37 lines omitted */
82 }
84 if (info->s->concurrent_insert) {
if (!error) {
85
while (...) {
86
/* 10 lines omitted */
}
97
}
98
rw_unlock(&info->s->
99
key_root_lock[inx]);
100 }
Read counter
Read counter
Conditional Locks
CASTL: Computer Architecture and
Security Technologies Lab
15
But, Precision Adds Overhead
Monitored Program Execution
Measured Machine State (Cache, Branch Predictor, etc)
“Correct” Machine State (Cache, Branch Predictor, etc)
Time
CASTL: Computer Architecture and
Security Technologies Lab
16
Instrumentation Adds Perturbation
Monitored Program Execution
Measured Machine State (Cache, Branch Predictor, etc)
“Correct” Machine State (Cache, Branch Predictor, etc)
Time
• If instrumentation sections are short,
perturbation is reduced and
measurements become more accurate
CASTL: Computer Architecture and
Security Technologies Lab
17
Performance Counter SW Landscape
Sampling
Interrupts every n
cycles and
extrapolates
Precise
Reads counters whenever
program or instrumentation
requests a read
Heavyweight
Examples • vTune
• OProfile
Overhead • Inversely
proportional to n
• Up to 20%
• Usually much less
Lightweight
• PAPI
• perf_event
• Proportional
to # of reads
• PAPI: 1048ns
• Perf_event:
262ns
CASTL: Computer Architecture and
Security Technologies Lab
18
Performance Counter SW Landscape
Sampling
Interrupts every n
cycles and
extrapolates
Precise
Reads counters whenever
program or instrumentation
requests a read
Heavyweight
Examples • vTune
• OProfile
Overhead • Inversely
proportional to n
• Up to 20%
• Usually much less
• PAPI
• perf_event
• Proportional
to # of reads
• PAPI: 1048ns
• Perf_event:
262ns
CASTL: Computer Architecture and
Security Technologies Lab
Lightweight
• LiMiT
• Proportional
to # of reads
• 11ns
19
Related Work
• No recent papers for better precise counting
– Original PAPI paper: Browne et al. 2000
– Some software, none offering LiMiT’s features
• Characterizing performance counters
– Weaver & Dongarra 2010
• Sampling
– Counter multiplexing techniques
• Mytkowicz et al. 2007
• Azimi et al. 2005
– Trace Alignment
• Mytkowicz et al. 2006
CASTL: Computer Architecture and
Security Technologies Lab
20
REDUCING COUNTER
READ OVERHEADS
Implementing lightweight, precise monitoring
CASTL: Computer Architecture and
Security Technologies Lab
21
Avoid system
calls to avoid
overhead
Why Precision
is Slow
Perfmon2 & Perf_event
Program requests
counter read
LiMiT
Program reads
counter
Why is this
so hard?
Kernel reads counter
and returns result
Program uses value
Program uses value
CASTL: Computer Architecture and
Security Technologies Lab
22
A Self-Monitoring Process
CASTL: Computer Architecture and
Security Technologies Lab
23
Run, process, run
L1 Misses
Branches
Cycles
CASTL: Computer Architecture and
Security Technologies Lab
53
24
39
24
Overflow
L1 Misses
7
Branches
24
Cycles
100
39
95
Psst!
CASTL: Computer Architecture and
Security Technologies Lab
25
Overflow
L1 Misses
7
Branches
24
Cycles
1 00
Overflow Space
L1 Misses
Branches
Cycles
CASTL: Computer Architecture and
Security Technologies Lab
0
0
0
100
26
Modified Read
20
+ 100
120
L1 Misses
Branches
Cycles
7
24
20
Overflow Space
L1 Misses
Branches
Cycles
CASTL: Computer Architecture and
Security Technologies Lab
0
0
100
27
Overflow During Read
99
L1 Misses
Branches
Cycles
7
24
99
Overflow Space
CASTL: Computer Architecture and
Security Technologies Lab
L1 Misses
Branches
0
0
Cycles
0
28
Overflow!
99
L1 Misses
7
Branches
24
Cycles
1 00
Overflow Space
L1 Misses
Branches
Cycles
CASTL: Computer Architecture and
Security Technologies Lab
0
0
100
0
29
Atomicity Violation!
99
+ 100
199
L1 Misses
Branches
Cycles
7
24
0
Overflow Space
L1 Misses
Branches
Cycles
CASTL: Computer Architecture and
Security Technologies Lab
0
0
100
30
OS Detection & Correction
99
L1 Misses
7
Branches
24
Cycles
1 00
Overflow Space
L1 Misses
Branches
Cycles
CASTL: Computer Architecture and
Security Technologies Lab
0
0
100
0
31
OS Detection & Correction
0
99
Looks like
he was
reading
that…
L1 Misses
Branches
Cycles
7
24
00
Overflow Space
L1 Misses
Branches
Cycles
CASTL: Computer Architecture and
Security Technologies Lab
0
0
100
32
Atomicity Violation Corrected
0
+ 100
100
L1 Misses
Branches
Cycles
7
24
0
Overflow Space
So what does all this
effort buy us?
CASTL: Computer Architecture and
Security Technologies Lab
L1 Misses
Branches
Cycles
0
0
100
33
Time to collect 3*107 readings
Time
User
System
Wall
PAPI
1.26s
Perf_event
0.53s
LiMiT
0.034s
Speedup
3.7x / 1.56x
30.10s
31.44s
7.30s
7.87s
0
0.34s
∞
92x / 23.1x
Average LiMiT Readout
Number of instructions
5
Number of cycles
37.14
Time
11.3 ns
CASTL: Computer Architecture and
Security Technologies Lab
34
LiMiT Enables Detailed Study
• Short counter reads decrease perturbation
• Little perturbation allows detailed study of
– Short synchronization regions
– Short function calls
• Three Case Studies
– Synchronization in production web applications
• Not presented here, see paper
– Synchronization changes in MySQL over time
– User/Kernel code behavior in runtime libraries
CASTL: Computer Architecture and
Security Technologies Lab
35
CASE STUDY:
LONGITUDINAL STUDY OF
LOCKING BEHAVIOR IN MYSQL
Has MySQL gotten better since the advent of multi-cores?
CASTL: Computer Architecture and
Security Technologies Lab
36
Evolution of Locking in MySQL
• Questions to answer
– Has MySQL gotten better at locking?
– What techniques have been used?
• Methodology
– Intercept pthread locking calls
– Count overheads and critical sections
CASTL: Computer Architecture and
Security Technologies Lab
37
MySQL Synchronization Times
100%
Percentage of Execution
90%
80%
70%
60%
Free
50%
Locking
40%
Lock Held
30%
Unlocking
20%
10%
0%
MySQL 4.1
(2004)
MySQL 5.0
(2005)
MySQL 5.1
(2008)
CASTL: Computer Architecture and
Security Technologies Lab
MySQL 5.5
(Beta, 2009)
38
MySQL Critical Sections
Overall Time With Lock Held
Avg. Lock Hold Time
1400
Percentage of Execution
with Lock Held
40%
1200
35%
1000
30%
25%
800
20%
600
15%
400
10%
200
5%
0%
Average Number of Cycles
Lock is Held
45%
0
MySQL 4.1
(2004)
MySQL 5.0
(2005)
MySQL 5.1
(2008)
CASTL: Computer Architecture and
Security Technologies Lab
MySQL 5.5
(Beta, 2009)
39
Number of Locks in MySQL
Static Locks
6.E+08
4.E+05
5.E+08
3.E+05
3.E+05
4.E+08
2.E+05
3.E+08
2.E+05
2.E+08
Static Locks
Dynamic Locks
Dynamic Locks
1.E+05
1.E+08
5.E+04
0.E+00
0.E+00
MySQL 4.1
(2004)
MySQL 5.0
(2005)
MySQL 5.1
(2008)
CASTL: Computer Architecture and
Security Technologies Lab
MySQL 5.5
(Beta, 2009)
40
Observations & Implications
• Coarser granularity, better performance
– Total critical section time has decreased
– Average CS times have increased
– Number of locks has decreased
• Performance counters useful for software
engineering studies
CASTL: Computer Architecture and
Security Technologies Lab
41
CASE STUDY:
KERNEL/USERSPACE OVERHEADS IN
RUNTIME LIBRARY
Does code in the kernel and runtime library behave?
CASTL: Computer Architecture and
Security Technologies Lab
42
Full System Analysis w/o Simulation
• Questions to answer
– How much time do system applications spend in
in runtime libraries?
– How well do they perform in them? Why?
• Methodology
– Intercept common libc, libm and libpthread calls
– Count user-/kernel- space events during the calls
– Break down by purpose (I/O, Memory, Pthread)
• Applications
– MySQL, Apache
• Intel Nehalem Microarchitecture
CASTL: Computer Architecture and
Security Technologies Lab
43
Execution Cycles in Library Calls
50%
Percentage of Total Cycles
45%
40%
35%
Pthreads
Memory
I/O
30%
25%
20%
15%
10%
5%
0%
MySQL (User)
MySQL (Kernel)
Apache (User)
CASTL: Computer Architecture and
Security Technologies Lab
Apache (Kernel)
44
MySQL Clocks per Instruction
2
1.8
Clocks per Instruction
1.6
1.4
1.2
1
0.8
0.6
0.4
0.2
0
User
Kernel
Libc
CASTL: Computer Architecture and
Security Technologies Lab
Program
45
L3 Cache MPKI
L3 MPKI
I/O
Memory
Pthreads
35
2
1.8
1.6
1.4
1.2
1
0.8
0.6
0.4
0.2
0
30
25
20
15
10
5
0
MySQL (User)
MySQL (Kernel)
Apache (User)
CASTL: Computer Architecture and
Security Technologies Lab
Apache
(Kernel)
46
I-Cache Stall Cycles
I/O
Memory
22.4%
3.0%
Pthreads
12.0%
Percentage of Total Cycles
2.5%
2.0%
1.5%
1.0%
0.5%
0.0%
MySQL (User)
MySQL (Kernel)
Apache (User)
CASTL: Computer Architecture and
Security Technologies Lab
Apache (Kernel)
47
Observations & Implications
• Apache is fundamentally I/O bound
– Optimization of the I/O subsystem necessary
• Kernel code suffers from I-Cache stalls
– Speculation: bad interrupt instruction prefetching
• LiMiT yields detailed performance data
– Not as accurate or detailed as simulation
– But gathered in hours rather than weeks
CASTL: Computer Architecture and
Security Technologies Lab
48
CONCLUSIONS
Research Methodology Implications,
Closing thoughts
CASTL: Computer Architecture and
Security Technologies Lab
49
Conclusions
• Implications from case studies
–
–
–
–
MySQL’s multicore experience helped scalability
Performance counting for non-architecture
Libraries and kernels perform very differently
I/O subsystems can be slow
• Research Methodology
– LiMiT can provide detailed results quickly
– Simulators are more detailed but slow
– Opportunity to build microbenchmarks
• Identify bottlenecks with counters
• Verify representativeness with counters
• Then simulate
CASTL: Computer Architecture and
Security Technologies Lab
50
QUESTIONS?
CASTL: Computer Architecture and
Security Technologies Lab
51
BACKUP SLIDES
Man down! Need backup!
CASTL: Computer Architecture and
Security Technologies Lab
52
Performance Evaluation Methods
Accuracy
Precision
Speed
Cost
Simulators
↑
↑
↓
↑/↓
Analytical
Models
Prototype
Hardware
?
?
↑
↓
↑
↑
↑
↑
Production
Hardware
↑/↓
↑/↓
↑
↓
Accuracy and Precision
are traded off
• Production hardware provides performance counters
• However, existing interfaces make accuracy/precision tradeoff difficult
CASTL: Computer Architecture and
Security Technologies Lab
53
Sampling vs. LiMiT
LiMiT Instrumented Program Execution
Start of
mutex_lock
Start of
mutex_unlock
Start of
barrier_wait
Sampled Program Execution
n cycles
n cycles
CASTL: Computer Architecture and
Security Technologies Lab
54
Another process runs
CASTL: Computer Architecture and
Security Technologies Lab
Miles
75
9
Pushups
Situps
24
39
55
Fix: Virtualization
30 Miles!
I did pretty
well today.
Miles
30
7
Pushups
Situps
24
39
No you
didn’t.
CASTL: Computer Architecture and
Security Technologies Lab
56
Avoiding Communication
Miles
CASTL: Computer Architecture and
Security Technologies Lab
30
Pushups
Situps
0
0
Miles
Pushups
Situps
7
24
39
57
LiMiT Operation
Program Execu on
Kernel Scheduling (Timer Interrupt Handler)
Counter Reading Code
Timer Interrupts
mov
$0, %edx
r dpmc
shl
or q
$32, %r dx
%r ax, %r dx
addq ovf l , %r dx
Process Swap
Kernel saves PMC
Different Program
Executes
Return to
Program
Process Swap
Kernel attempts to restore PMC
PMC0 < 2³¹
PMC0 >= 2³¹
No
Regular mer interrupt processing
Transi on to kernel
Special kernel handling required
to avoid double coun ng.
Atomicity Violation!
Error handler:
reset %rdx, %rax before
returning to program
Yes
Detect Counter Read
Counter Overflow!
Is the program currently
executing a PMC read?
Examine interrupted instructions
and look for read pattern
Kernel increments overflow
variable and resets counter:
ovfl += PMC0
PMC0 = 0
CASTL: Computer Architecture and
Security Technologies Lab
58
RDTSC
?@#*='45'A$4*#, , 'B, 4+%C4: '4: 'A#$54$2 %: *#'D 4: E=4$E: &'
! "#$%&#'( ) *+#, '- '. / '
- . /01%
+"#$! ) %
*"#$! ' %
) "#$! ' %
( "#$! ' %
&"#$! ' %
! "#$! ! %
234 3/%
No Resource Core Sharing Process
Sharing
(SMT)
Swapping
!%
*%
+) %
&( %
, &%
(!%
( *%
' )%
)(%
0 12 3#$'45'67$#%8, '9. : '; '( 4$#'<) , =#2 >'
CASTL: Computer Architecture and
Security Technologies Lab
59
MySQL Instrumentation Overhead
MySQL Execution Cycles (User Time)
2.50E+12
2.00E+12
1.50E+12
1.00E+12
5.00E+11
0.00E+00
None
LiMiT
perf_event
CASTL: Computer Architecture and
Security Technologies Lab
PAPI
60
CASE STUDY A:
LOCKING IN WEB WORKLOADS
How does web-related software use locks?
CASTL: Computer Architecture and
Security Technologies Lab
61
Locking on the Web
• Questions to answer
– Is locking a significant concern?
– How can architects help?
– Are traditional benchmarks similar?
• Methodology
– Intercept pthread mutex calls, time w/ LiMiT
• Applications
–
–
–
–
Firefox
Apache
MySQL
PARSEC
CASTL: Computer Architecture and
Security Technologies Lab
62
Execution Time by Region
100%
Percentage of Total User Cycles
90%
80%
70%
60%
Free
50%
Lock
40%
Lock Held
30%
Unlock
20%
10%
0%
Firefox Apache Parsec MySQL Apache Parsec MySQL
LiMiT LiMiT LiMiT LiMiT PAPI
PAPI
PAPI
CASTL: Computer Architecture and
Security Technologies Lab
63
Locking Statistics
Avg. Lock Held
Time (cycles)
Dynamic Locks
per 10k Cycles
Static Locks
Firefox
Apache
PARSEC
MySQL
789
149
118
1076
3.24
1.12
0.545
3.18
57
1
17
13853
CASTL: Computer Architecture and
Security Technologies Lab
64
Observations & Implications
• Applications like Firefox and MySQL use
locks differently from Apache and PARSEC
– Many notions of synchronization based on
scientific computing probably don’t apply
• Locking overheads up to 8 - 13%
– More efficient mechanisms may be helpful
– But, 13% is upper bound on speedup
• MySQL has some very long critical sections
– Prime targets for micro-arch optimization
– If they run faster, MySQL scales better
CASTL: Computer Architecture and
Security Technologies Lab
65
Hardware Enhancements
• 64-bit Reads and Writes
– Overflows are primary source of complexity
– 64-bit counters w/ full read/write eliminates it
• Destructive Reads
– Difference = 2 reads, store, load & subtract
– Destructive read difference = 2 reads
• Combined Reads
– X86 counter read requires 2 instructions
– Combining should reduce overhead
• AMD’s Lightweight Profiling Proposal
– Really good, depending on microarchitecture
CASTL: Computer Architecture and
Security Technologies Lab
66