Respec: Efficient Online Multiprocessor Replay via Speculation and External Determinism Dongyoon Lee, Benjamin Wester, Kaushik Veeraraghavan, Satish Narayanasamy, Peter M.
Download ReportTranscript Respec: Efficient Online Multiprocessor Replay via Speculation and External Determinism Dongyoon Lee, Benjamin Wester, Kaushik Veeraraghavan, Satish Narayanasamy, Peter M.
Respec: Efficient Online Multiprocessor Replay via Speculation and External Determinism Dongyoon Lee
, Benjamin Wester, Kaushik Veeraraghavan, Satish Narayanasamy, Peter M. Chen, and Jason Flinn
University of Michigan, Ann Arbor
Deterministic Replay Deterministic Replay
• Record and reproduce non-deterministic events
1) Offline
Uses: replay repeatedly after original run • • Debugging Forensics
2) Online
Uses: record and replay concurrently • Fault tolerance • Decoupled runtime checks
We focus on online replay for multi-processors
Dongyoon Lee 2
Online Deterministic Replay Uses
Fault Tolerance Decoupled Runtime Checks
Request
Server
log Response
Fault !!
Replica
replay keep the same state
Takeover App P1
A B C
Replay + Check P2 P3 P4
A + Check B + Check C + Check • • Need to record and replay
concurrently
Both recording and replaying should be
efficient
Dongyoon Lee 3
Past Solutions for Deterministic Replay
Uniprocessor Replay
• • Program Input (e.g. system calls, signals, etc) Thread scheduling Multiprocessor Replay:
+ Shared memory dependencies
•
Instrument every memory operation
• • PinSEL [Pereira, IISWC’08] , iDNA [Bhansali, VEE’06]
→ 10-100x Page protection
SMP-ReVirt [Dunlap, VEE’08]
→ 2-9x Offline search
ODR [Altekar, SOSP’09] , PRES [Park, SOSP’09] Replay-SAT [Lee, MICRO’09]
→ Slow replay
•
Hardware support
FDR [Xu, ISCA’03] , Strata [Narayanasamy, ASPLOS’06] , ReRun [Hower, ISCA’08] , DeLorean [Montesinos, ISCA’08]
→ Custom HW
Dongyoon Lee 4
Overview of Our Approach
Goal: Efficient
online
software-only multiprocessor replay T1
Checkpoint A
T2
Lock(l) Unlock(l) Lock(l) multi-threaded fork Speculate Race free Checkpoint B
T1’
A’ Lock(l’) Unlock(l’)
T2’
Lock(l’) Check B’==B?
Recorded Process Replayed Process
Key Idea: Speculation + Check
1) Speculate data race free 2) Detect mis-speculation using a cheap check 3) Rollback and retry on mis-speculation Dongyoon Lee 5
Roadmap
• • • •
Motivation/Overview
Respec Design
1. Speculate data race free 2. Detect mis-speculation 3. Rollback and Retry on mis-speculation
Evaluation Conclusion
Dongyoon Lee 6
Deterministic Replay of Data-race-free Programs
Observation
• Reproducing
program input
and happens-before
order of sync. operations
guarantees deterministic replay of data-race-free programs [Ronsse and Bosschere ’99]
1) Program input
( e.g. system calls, signals, etc. ) • Record: Log system call effects + total order • Replay: Emulate system call + total order
2) Synchronization Operations
• • Record and replay happens-before order Instrument common (not all) synchronization primitives in glibc Dongyoon Lee 7
What if a program is
NOT
race free?
Problem
• • Need to detect mis-speculation Data race detector is too heavy-weight
Insight:
External Determinism
is sufficient
• Not necessary to replay data races • Ensure that the replayed process produces the
same visible effects
as the recorded process to an external observer
Visible effects = System output + Final program state Solution:
Divergence checks
• Detect mis-speculation when the replay is not externally deterministic Dongyoon Lee 8
Divergence Check #1 – System Output
1) System Output Check
• • For every system call, compare
system call argument
Ensure that the replay produces the
same output
as the recorded process T1
Start A
T2
multi-threaded fork
T1’ T2’
Start A’ Lock(l) Unlock(l) Lock(l’) Unlock(l’) Lock(l) Lock(l’) SysRead X SysRead X’
Recorded Process
SysWrite O
Dongyoon Lee Replayed Process
SysWrite O’ Check O’==O?
9
Benign Data Races
• • Not all races cause divergence checks to fail A data race is
inconsequential
if system output matches T1
Start A
T2
x!=0?
x!=0?
multi-threaded fork x=1 x!=0?
T1’
Start A’
T2’
x=1 x!=0?
SysWrite(x)
Recorded Process Dongyoon Lee
SysWrite(x) Success
Replayed Process 10
Divergence due to Data Races
T1
Start A
T2 T1’
Start A’
T2’
multi-threaded fork x=2 x=1 x=1 x=2 SysWrite(x) SysWrite(x) Fail Replayed Process Recorded Process
1) Need to rollback
to the beginning
2) Need to buffer system output
till the end
Dongyoon Lee 11
Divergence Check #2 – Program State
2) Program state check
• • Compare
register and memory state
at semi-regular intervals (epochs) Construct a
safe intermediate point
– To release buffered output – To rollback to in case of mis-speculation T1 T2 T1’
Start A Start A’
T2’
Checkpoint B x=2 Release Output x=1 SysWrite(x) Recorded Process
Dongyoon Lee
B’ == B ?
Success x=1 x=2 SysWrite(x) Fail Replayed Process
12
Recovery from Mis-speculation
Rollback
• Rollback both recorded and replayed processes to the previous checkpoint
Re-execute
• • Optimistically re-run the failed epoch On repeated failure, switch to uniprocessor execution model – Record and replay
only one thread at a time
– Parallel execution resumes after the failed interval T1 T2 T1’ T2’
Checkpoint B x=1 x=2 Check B’==B?
x=1 x=2 Fail
Dongyoon Lee 13
Speculative Execution
Speculator [Nightingale et al. SOSP’05]
• Buffer output during speculation • Block execution if speculative execution is not feasible • Release buffered output
on commit
• Undo speculative changes and squash buffered output
on mis-speculation
Dongyoon Lee 14
Roadmap
• • • •
Motivation/Overview Respec Design
Evaluation
1. Performance results 2. Breakdown of performance overhead 3. Rollback frequency and overhead
Conclusion
Dongyoon Lee 15
Evaluation Setup
Test Environment
• • • 2 GHz
8 core
Xeon processor with 3 GB of RAM Run
1~4 worker
threads (excluding control threads) Collect the average of 10 trials (except pbzip2 and aget)
Benchmarks
• PARSEC suite – blackscholes, bodytrack, fluidanimate, swaptions, streamcluster • SPLASH-2 suite – ocean, raytrace, volrend, water-nsq, fft, and radix • Real applications – pbzip2, pfscan, aget, and Apache Dongyoon Lee 16
Record and Replay Performance
0,8 0,6 0,4 0,2 0 2 1,8 1,6 1,4 1,2 1 1 2 3 4 1 2 3 4 1 2 4 1 2 3 4 1 2 3 4 1 2 4 1 2 3 4 1 2 3 4 1 2 3 4 1 2 4 1 2 4 1 2 3 4 1 2 3 4 1 2 3 4 1 2 3 4 black scholes body track fluid animate swap tions stream cluster ocean raytrace volrend water nsq fft radix pfscan pbzip2 aget Apache •
18%
for 2 threads,
55%
for 4 threads • Real applications (including
Apache
) showed <50% for 4 threads Dongyoon Lee 17
1) Redundant Execution Overhead (25%)
0,8 0,6 0,4 0,2 0 2 1,8 1,6 1,4 1,2 1
Redundant execution overhead
(25%) 1 2 3 4 1 2 3 4 1 2 4 1 2 3 4 1 2 3 4 1 2 4 1 2 3 4 1 2 3 4 1 2 3 4 1 2 4 1 2 4 1 2 3 4 1 2 3 4 1 2 3 4 1 2 3 4 black scholes body track fluid animate swap tions stream cluster ocean raytrace volrend water nsq fft radix pfscan pbzip2 aget Apache • Cost of running two executions (Lower bound of online replay) • Mainly due to sharing limited resources: memory system • Contribute
25%
of total cost for 4 threads Dongyoon Lee 18
2) Epoch overhead (17%)
0,8 0,6 0,4 0,2 0 2 1,8 1,6 1,4 1,2 1 Redundant execution overhead (25%)
Epoch overhead
(17%) 1 2 3 4 1 2 3 4 1 2 4 1 2 3 4 1 2 3 4 1 2 4 1 2 3 4 1 2 3 4 1 2 3 4 1 2 4 1 2 4 1 2 3 4 1 2 3 4 1 2 3 4 1 2 3 4 black scholes body track fluid animate swap tions stream cluster ocean raytrace volrend water nsq fft radix pfscan pbzip2 aget Apache • Due to checkpoint cost • Due to artificial epoch barrier cost • Contribute
17%
of total cost for 4 threads Dongyoon Lee 19
3) Memory Comparison Overhead (16%)
0,8 0,6 0,4 0,2 0 2 1,8 1,6 1,4 1,2 1 Redundant execution overhead (25%) Epoch overhead (17%)
Memory comparison overhead
(16%) 1 2 3 4 1 2 3 4 1 2 4 1 2 3 4 1 2 3 4 1 2 4 1 2 3 4 1 2 3 4 1 2 3 4 1 2 4 1 2 4 1 2 3 4 1 2 3 4 1 2 3 4 1 2 3 4 black scholes body track fluid animate swap tions stream cluster ocean raytrace volrend water nsq fft radix pfscan pbzip2 aget Apache • Optimization 1. compare dirty pages only • Optimization 2. parallelize comparison • Contribute
16%
of total cost for 4 threads Dongyoon Lee 20
4) Logging Overhead (42%)
0,8 0,6 0,4 0,2 0 2 1,8 1,6 1,4 1,2 1 Redundant execution overhead (25%) Epoch overhead (17%) Memory comparison overhead (16%)
Logging and other overhead
(42%) 1 2 3 4 1 2 3 4 1 2 4 1 2 3 4 1 2 3 4 1 2 4 1 2 3 4 1 2 3 4 1 2 3 4 1 2 4 1 2 4 1 2 3 4 1 2 3 4 1 2 3 4 1 2 3 4 black scholes body track fluid animate swap tions stream cluster ocean raytrace volrend water nsq fft radix pfscan pbzip2 aget Apache • Logging synchronization operations and system calls overhead • Main cost for applications with fine-grained synchronizations • Contribute
42%
of total cost for 4 threads Dongyoon Lee 21
Rollback Frequency and Overhead
App.
Pbzip2
(100 runs)
Aget
(50 runs)
Threads
4 4
Rollback Frequency
84% none
15% once 1% twice
80% none
18% once 2% twice Overhead
41% 66% 105% 6% 6% 6%
Avg. Overhead
45% 6% • Pbzip2(
16%
) and Aget(
20%
) invoke one or more rollbacks • Pbzip2: Rollbacks contribute
<10%
of total overhead • Aget: Rollback overhead is
negligible
• frequent checkpoints => short epochs => small amount of work to be re-done Dongyoon Lee 22
Conclusion
Goal: Deterministic replay for multithreaded programs • •
Software-only
: no custom hardware
Online
: record and replay concurrently
Contributions to replay
•
Speculation
: speculate race-free, and rollback/retry if needed •
External Determinism
: Match system output and program states
Results
• Performance overhead
record and replay
concurrently • 2 threads:
18%
• 4 threads:
55% Thank you…
Dongyoon Lee 23
Thank you
Dongyoon Lee 24
Benign Data Races
Benign data races could cause frequent rollbacks
• Performance (
NOT
correctness) issue • The latest Java and C++ memory model
prohibits
benign races => There are only
harmful
races [Manson et al. POPL’05],[Boehm et al. PLDI’08] • Programmers should explicitly
annotate
intentionally racy variables (e.g. handcrafted synchronization) using volatile/atomic keywords • Could automatically detect and instrument Dongyoon Lee 25
Implementation
Modify Linux 2.6.27 kernel
• Deterministic replay • Multithreaded fork • Record/replay program input (e.g. system calls, signals, …) • Compare program state (memory and register contents) • Speculator [Nightingale et al. SOSP’05] • Checkpoint and rollback • Buffer system output or propagate speculative states
Modify glibc 2.5.1
• Support recording/replaying low-level synchronization operations • e.g. locks, unlock, futex waits, futex wakes Dongyoon Lee 26
Handling System Calls
Replayed process
1)
Emulate
• most system calls Feed logged return value and data copied into the process 2)
Re-execute
• some system calls Create or delete threads : clone, exit, … • Modify address space: mmap2, mprotect, …
Problem
• Does
NOT
recreate most kernel state associated with the replayed process (e.g. the file descriptor table) • Process can
NOT
transition from replaying to live execution
Solution
• Recreate the OS state by
re-executing
native/virtualized system calls ReVirt [Dunlap et al. OSDI’02], Zap [Osman et al. OSDI’02] Dongyoon Lee 27
Multi-threaded Fork (Checkpoint)
Copy-on-write fork
• Linux’s fork supports fork of only single thread • Need new copy-on-write primitive for checkpointing multithreads • Should checkpoint a thread at safe point • kernel entry/exit (system call)
Multi-threaded fork
1) The initiating thread that initiates a multithreaded fork creates a barrier on which it waits until all other threads reach a safe point 2) Once all threads reach the barrier, the original thread creates the checkpoint, then let other threads continue execution.
Semi-regular checkpoints
• Adaptive epoch length • To bound the amount of work that must be redone on rollback • Output triggered commit • To provide acceptable latency for interactive tasks Dongyoon Lee 28
Benefits of Program State Check
1) Allow Respec to commit epochs and release system output • • Buffer output during speculation Safe to release output on commit after matching program state 2) Reduce the amount of execution that must be re-done when a check fails 3) Allow broader uses of replay system • Tolerating non-fail-stop faults (e.g. transient hardware fault) • Need to detect latent faults • Parallelizing security and reliability checks Dongyoon Lee 29
Offline Replay with Respec
Respec Log
• • Kernel’s system call + User-level synchronizations MD5 checksum of address space and register state Problem: Not all races are logged • Offline replay is
NOT guaranteed
to succeed • Since the recorded process has been replayed successfully at least once, it is likely that offline replay will eventually succeed
Solution
• Offline replay search tools can be used e.g. ODR [Altekar et al. SOSP’09] , PRES [Park et al. SOSP’09] , Replay-SAT [Lee et al. MICRO’09] Dongyoon Lee 30
Non-Deterministic Program Input
• e.g. I/O, DMA, interrupts, signals, RDTSC, context-switch, page fault • Asynchronous interrupts (caused by external sources) • eg. I/O, timer, disk read completion • Synchronous interrupts (=traps) • eg. arithmetic overflow exceptions, invoking system calls, page fault, TLB miss • x86 instructions (can return non-deterministic results, but do not normally trap when running in user mode) • eg. rdtsc(read timestamp counter), rdpmc(read performance monitoring counter) Dongyoon Lee 31
Rollback Frequency and Overhead (Pbzip2)
Threads
1 2 3 4
Rollback Frequency
0% 13% once 9% once 2% twice 84% no rollback 15% once 1% twice
Original Time (sec)
4.59
2.35
1.64
1.33
Type
Overall w/o rollback w/ rollback overall w/o rollback w/ rollback overall w/o rollback w/ rollback overall
Respec Time (sec)
4.83
2.70
2.97
2.73
2.00
2.29
1.03
1.88
2.24
1.93
• • Out of 100 runs, 13-16% of executions invoke more than one rollbacks Rollbacks contribute 8% of Respec's total overhead Dongyoon Lee
Slowdown
5% 15% 26% 16% 22% 40% 24% 41% 68% 45% 32
Rollback Frequency and Overhead (Aget)
Threads Rollback Frequency Original Time (sec) Type
1 2 3 4 10% once 2% twice 20% once 2% twice 24% once 18% once 2% twice 2.05
1.93
1.94
1.96
w/o rollback w/ rollback overall w/o rollback w/ rollback overall w/o rollback w/ rollback overall w/o rollback w/ rollback overall • • Out of 50 runs, 14-24% of executions invoke more than one rollbacks Peformance impact is negligible (due to very frequent checkpoint)
Respec Time (sec)
2.19
2.21
2.19
2.17
2.17
2.17
2.08
2.09
2.08
2.07
2.08
2.08
Slowdown
33 7% 8% 7% 13% 13% 13% 7% 8% 7% 6% 6% 6%