Respec: Efficient Online Multiprocessor Replay via Speculation and External Determinism Dongyoon Lee, Benjamin Wester, Kaushik Veeraraghavan, Satish Narayanasamy, Peter M.

Download Report

Transcript Respec: Efficient Online Multiprocessor Replay via Speculation and External Determinism Dongyoon Lee, Benjamin Wester, Kaushik Veeraraghavan, Satish Narayanasamy, Peter M.

Respec: Efficient Online Multiprocessor Replay via Speculation and External Determinism Dongyoon Lee

, Benjamin Wester, Kaushik Veeraraghavan, Satish Narayanasamy, Peter M. Chen, and Jason Flinn

University of Michigan, Ann Arbor

Deterministic Replay Deterministic Replay

• Record and reproduce non-deterministic events

1) Offline

Uses: replay repeatedly after original run • • Debugging Forensics

2) Online

Uses: record and replay concurrently • Fault tolerance • Decoupled runtime checks

We focus on online replay for multi-processors

Dongyoon Lee 2

Online Deterministic Replay Uses

Fault Tolerance Decoupled Runtime Checks



log Response

Fault !!


replay keep the same state

Takeover App P1


Replay + Check P2 P3 P4

A + Check B + Check C + Check • • Need to record and replay


Both recording and replaying should be


Dongyoon Lee 3

Past Solutions for Deterministic Replay

Uniprocessor Replay

• • Program Input (e.g. system calls, signals, etc) Thread scheduling Multiprocessor Replay:

+ Shared memory dependencies

Instrument every memory operation

• • PinSEL [Pereira, IISWC’08] , iDNA [Bhansali, VEE’06]

→ 10-100x Page protection

SMP-ReVirt [Dunlap, VEE’08]

→ 2-9x Offline search

ODR [Altekar, SOSP’09] , PRES [Park, SOSP’09] Replay-SAT [Lee, MICRO’09]

→ Slow replay

Hardware support

FDR [Xu, ISCA’03] , Strata [Narayanasamy, ASPLOS’06] , ReRun [Hower, ISCA’08] , DeLorean [Montesinos, ISCA’08]

→ Custom HW

Dongyoon Lee 4

Overview of Our Approach

Goal: Efficient


software-only multiprocessor replay T1

Checkpoint A


Lock(l) Unlock(l) Lock(l) multi-threaded fork Speculate Race free Checkpoint B


A’ Lock(l’) Unlock(l’)


Lock(l’) Check B’==B?

Recorded Process Replayed Process

Key Idea: Speculation + Check

1) Speculate data race free 2) Detect mis-speculation using a cheap check 3) Rollback and retry on mis-speculation Dongyoon Lee 5


• • • •


Respec Design

1. Speculate data race free 2. Detect mis-speculation 3. Rollback and Retry on mis-speculation

Evaluation Conclusion

Dongyoon Lee 6

Deterministic Replay of Data-race-free Programs


• Reproducing

program input

and happens-before

order of sync. operations

guarantees deterministic replay of data-race-free programs [Ronsse and Bosschere ’99]

1) Program input

( e.g. system calls, signals, etc. ) • Record: Log system call effects + total order • Replay: Emulate system call + total order

2) Synchronization Operations

• • Record and replay happens-before order Instrument common (not all) synchronization primitives in glibc Dongyoon Lee 7

What if a program is


race free?


• • Need to detect mis-speculation Data race detector is too heavy-weight


External Determinism

is sufficient

• Not necessary to replay data races • Ensure that the replayed process produces the

same visible effects

as the recorded process to an external observer

Visible effects = System output + Final program state Solution:

Divergence checks

• Detect mis-speculation when the replay is not externally deterministic Dongyoon Lee 8

Divergence Check #1 – System Output

1) System Output Check

• • For every system call, compare

system call argument

Ensure that the replay produces the

same output

as the recorded process T1

Start A


multi-threaded fork

T1’ T2’

Start A’ Lock(l) Unlock(l) Lock(l’) Unlock(l’) Lock(l) Lock(l’) SysRead X SysRead X’

Recorded Process

SysWrite O

Dongyoon Lee Replayed Process

SysWrite O’ Check O’==O?


Benign Data Races

• • Not all races cause divergence checks to fail A data race is


if system output matches T1

Start A




multi-threaded fork x=1 x!=0?


Start A’


x=1 x!=0?


Recorded Process Dongyoon Lee

SysWrite(x) Success

Replayed Process 10

Divergence due to Data Races


Start A

T2 T1’

Start A’


multi-threaded fork x=2 x=1 x=1 x=2 SysWrite(x) SysWrite(x) Fail Replayed Process Recorded Process

1) Need to rollback

to the beginning

2) Need to buffer system output

till the end

Dongyoon Lee 11

Divergence Check #2 – Program State

2) Program state check

• • Compare

register and memory state

at semi-regular intervals (epochs) Construct a

safe intermediate point

– To release buffered output – To rollback to in case of mis-speculation T1 T2 T1’

Start A Start A’


Checkpoint B x=2 Release Output x=1 SysWrite(x) Recorded Process

Dongyoon Lee

B’ == B ?

Success x=1 x=2 SysWrite(x) Fail Replayed Process


Recovery from Mis-speculation


• Rollback both recorded and replayed processes to the previous checkpoint


• • Optimistically re-run the failed epoch On repeated failure, switch to uniprocessor execution model – Record and replay

only one thread at a time

– Parallel execution resumes after the failed interval T1 T2 T1’ T2’

Checkpoint B x=1 x=2 Check B’==B?

x=1 x=2 Fail

Dongyoon Lee 13

Speculative Execution

Speculator [Nightingale et al. SOSP’05]

• Buffer output during speculation • Block execution if speculative execution is not feasible • Release buffered output

on commit

• Undo speculative changes and squash buffered output

on mis-speculation

Dongyoon Lee 14


• • • •

Motivation/Overview Respec Design


1. Performance results 2. Breakdown of performance overhead 3. Rollback frequency and overhead


Dongyoon Lee 15

Evaluation Setup

Test Environment

• • • 2 GHz

8 core

Xeon processor with 3 GB of RAM Run

1~4 worker

threads (excluding control threads) Collect the average of 10 trials (except pbzip2 and aget)


• PARSEC suite – blackscholes, bodytrack, fluidanimate, swaptions, streamcluster • SPLASH-2 suite – ocean, raytrace, volrend, water-nsq, fft, and radix • Real applications – pbzip2, pfscan, aget, and Apache Dongyoon Lee 16

Record and Replay Performance

0,8 0,6 0,4 0,2 0 2 1,8 1,6 1,4 1,2 1 1 2 3 4 1 2 3 4 1 2 4 1 2 3 4 1 2 3 4 1 2 4 1 2 3 4 1 2 3 4 1 2 3 4 1 2 4 1 2 4 1 2 3 4 1 2 3 4 1 2 3 4 1 2 3 4 black scholes body track fluid animate swap tions stream cluster ocean raytrace volrend water nsq fft radix pfscan pbzip2 aget Apache •


for 2 threads,


for 4 threads • Real applications (including


) showed <50% for 4 threads Dongyoon Lee 17

1) Redundant Execution Overhead (25%)

0,8 0,6 0,4 0,2 0 2 1,8 1,6 1,4 1,2 1

Redundant execution overhead

(25%) 1 2 3 4 1 2 3 4 1 2 4 1 2 3 4 1 2 3 4 1 2 4 1 2 3 4 1 2 3 4 1 2 3 4 1 2 4 1 2 4 1 2 3 4 1 2 3 4 1 2 3 4 1 2 3 4 black scholes body track fluid animate swap tions stream cluster ocean raytrace volrend water nsq fft radix pfscan pbzip2 aget Apache • Cost of running two executions (Lower bound of online replay) • Mainly due to sharing limited resources: memory system • Contribute


of total cost for 4 threads Dongyoon Lee 18

2) Epoch overhead (17%)

0,8 0,6 0,4 0,2 0 2 1,8 1,6 1,4 1,2 1 Redundant execution overhead (25%)

Epoch overhead

(17%) 1 2 3 4 1 2 3 4 1 2 4 1 2 3 4 1 2 3 4 1 2 4 1 2 3 4 1 2 3 4 1 2 3 4 1 2 4 1 2 4 1 2 3 4 1 2 3 4 1 2 3 4 1 2 3 4 black scholes body track fluid animate swap tions stream cluster ocean raytrace volrend water nsq fft radix pfscan pbzip2 aget Apache • Due to checkpoint cost • Due to artificial epoch barrier cost • Contribute


of total cost for 4 threads Dongyoon Lee 19

3) Memory Comparison Overhead (16%)

0,8 0,6 0,4 0,2 0 2 1,8 1,6 1,4 1,2 1 Redundant execution overhead (25%) Epoch overhead (17%)

Memory comparison overhead

(16%) 1 2 3 4 1 2 3 4 1 2 4 1 2 3 4 1 2 3 4 1 2 4 1 2 3 4 1 2 3 4 1 2 3 4 1 2 4 1 2 4 1 2 3 4 1 2 3 4 1 2 3 4 1 2 3 4 black scholes body track fluid animate swap tions stream cluster ocean raytrace volrend water nsq fft radix pfscan pbzip2 aget Apache • Optimization 1. compare dirty pages only • Optimization 2. parallelize comparison • Contribute


of total cost for 4 threads Dongyoon Lee 20

4) Logging Overhead (42%)

0,8 0,6 0,4 0,2 0 2 1,8 1,6 1,4 1,2 1 Redundant execution overhead (25%) Epoch overhead (17%) Memory comparison overhead (16%)

Logging and other overhead

(42%) 1 2 3 4 1 2 3 4 1 2 4 1 2 3 4 1 2 3 4 1 2 4 1 2 3 4 1 2 3 4 1 2 3 4 1 2 4 1 2 4 1 2 3 4 1 2 3 4 1 2 3 4 1 2 3 4 black scholes body track fluid animate swap tions stream cluster ocean raytrace volrend water nsq fft radix pfscan pbzip2 aget Apache • Logging synchronization operations and system calls overhead • Main cost for applications with fine-grained synchronizations • Contribute


of total cost for 4 threads Dongyoon Lee 21

Rollback Frequency and Overhead



(100 runs)


(50 runs)


4 4

Rollback Frequency

84% none

15% once 1% twice

80% none

18% once 2% twice Overhead

41% 66% 105% 6% 6% 6%

Avg. Overhead

45% 6% • Pbzip2(


) and Aget(


) invoke one or more rollbacks • Pbzip2: Rollbacks contribute


of total overhead • Aget: Rollback overhead is


• frequent checkpoints => short epochs => small amount of work to be re-done Dongyoon Lee 22


Goal: Deterministic replay for multithreaded programs • •


: no custom hardware


: record and replay concurrently

Contributions to replay


: speculate race-free, and rollback/retry if needed •

External Determinism

: Match system output and program states


• Performance overhead

record and replay

concurrently • 2 threads:


• 4 threads:

55% Thank you…

Dongyoon Lee 23

Thank you

Dongyoon Lee 24

Benign Data Races

Benign data races could cause frequent rollbacks

• Performance (


correctness) issue • The latest Java and C++ memory model


benign races => There are only


races [Manson et al. POPL’05],[Boehm et al. PLDI’08] • Programmers should explicitly


intentionally racy variables (e.g. handcrafted synchronization) using volatile/atomic keywords • Could automatically detect and instrument Dongyoon Lee 25


Modify Linux 2.6.27 kernel

• Deterministic replay • Multithreaded fork • Record/replay program input (e.g. system calls, signals, …) • Compare program state (memory and register contents) • Speculator [Nightingale et al. SOSP’05] • Checkpoint and rollback • Buffer system output or propagate speculative states

Modify glibc 2.5.1

• Support recording/replaying low-level synchronization operations • e.g. locks, unlock, futex waits, futex wakes Dongyoon Lee 26

Handling System Calls

Replayed process



• most system calls Feed logged return value and data copied into the process 2)


• some system calls Create or delete threads : clone, exit, … • Modify address space: mmap2, mprotect, …


• Does


recreate most kernel state associated with the replayed process (e.g. the file descriptor table) • Process can


transition from replaying to live execution


• Recreate the OS state by


native/virtualized system calls ReVirt [Dunlap et al. OSDI’02], Zap [Osman et al. OSDI’02] Dongyoon Lee 27

Multi-threaded Fork (Checkpoint)

Copy-on-write fork

• Linux’s fork supports fork of only single thread • Need new copy-on-write primitive for checkpointing multithreads • Should checkpoint a thread at safe point • kernel entry/exit (system call)

Multi-threaded fork

1) The initiating thread that initiates a multithreaded fork creates a barrier on which it waits until all other threads reach a safe point 2) Once all threads reach the barrier, the original thread creates the checkpoint, then let other threads continue execution.

Semi-regular checkpoints

• Adaptive epoch length • To bound the amount of work that must be redone on rollback • Output triggered commit • To provide acceptable latency for interactive tasks Dongyoon Lee 28

Benefits of Program State Check

1) Allow Respec to commit epochs and release system output • • Buffer output during speculation Safe to release output on commit after matching program state 2) Reduce the amount of execution that must be re-done when a check fails 3) Allow broader uses of replay system • Tolerating non-fail-stop faults (e.g. transient hardware fault) • Need to detect latent faults • Parallelizing security and reliability checks Dongyoon Lee 29

Offline Replay with Respec

Respec Log

• • Kernel’s system call + User-level synchronizations MD5 checksum of address space and register state Problem: Not all races are logged • Offline replay is

NOT guaranteed

to succeed • Since the recorded process has been replayed successfully at least once, it is likely that offline replay will eventually succeed


• Offline replay search tools can be used e.g. ODR [Altekar et al. SOSP’09] , PRES [Park et al. SOSP’09] , Replay-SAT [Lee et al. MICRO’09] Dongyoon Lee 30

Non-Deterministic Program Input

• e.g. I/O, DMA, interrupts, signals, RDTSC, context-switch, page fault • Asynchronous interrupts (caused by external sources) • eg. I/O, timer, disk read completion • Synchronous interrupts (=traps) • eg. arithmetic overflow exceptions, invoking system calls, page fault, TLB miss • x86 instructions (can return non-deterministic results, but do not normally trap when running in user mode) • eg. rdtsc(read timestamp counter), rdpmc(read performance monitoring counter) Dongyoon Lee 31

Rollback Frequency and Overhead (Pbzip2)


1 2 3 4

Rollback Frequency

0% 13% once 9% once 2% twice 84% no rollback 15% once 1% twice

Original Time (sec)






Overall w/o rollback w/ rollback overall w/o rollback w/ rollback overall w/o rollback w/ rollback overall

Respec Time (sec)











• • Out of 100 runs, 13-16% of executions invoke more than one rollbacks Rollbacks contribute 8% of Respec's total overhead Dongyoon Lee


5% 15% 26% 16% 22% 40% 24% 41% 68% 45% 32

Rollback Frequency and Overhead (Aget)

Threads Rollback Frequency Original Time (sec) Type

1 2 3 4 10% once 2% twice 20% once 2% twice 24% once 18% once 2% twice 2.05




w/o rollback w/ rollback overall w/o rollback w/ rollback overall w/o rollback w/ rollback overall w/o rollback w/ rollback overall • • Out of 50 runs, 14-24% of executions invoke more than one rollbacks Peformance impact is negligible (due to very frequent checkpoint)

Respec Time (sec)














33 7% 8% 7% 13% 13% 13% 7% 8% 7% 6% 6% 6%