Respec: Efficient Online Multiprocessor Replay via Speculation and External Determinism Dongyoon Lee, Benjamin Wester, Kaushik Veeraraghavan, Satish Narayanasamy, Peter M.

Download Report

Transcript Respec: Efficient Online Multiprocessor Replay via Speculation and External Determinism Dongyoon Lee, Benjamin Wester, Kaushik Veeraraghavan, Satish Narayanasamy, Peter M.

Respec: Efficient Online Multiprocessor Replay via Speculation and External Determinism Dongyoon Lee

, Benjamin Wester, Kaushik Veeraraghavan, Satish Narayanasamy, Peter M. Chen, and Jason Flinn

University of Michigan, Ann Arbor

Deterministic Replay Deterministic Replay

• Record and reproduce non-deterministic events

1) Offline

Uses: replay repeatedly after original run • • Debugging Forensics

2) Online

Uses: record and replay concurrently • Fault tolerance • Decoupled runtime checks

We focus on online replay for multi-processors

Dongyoon Lee 2

Online Deterministic Replay Uses

Fault Tolerance Decoupled Runtime Checks

Request

Server

log Response

Fault !!

Replica

replay keep the same state

Takeover App P1

A B C

Replay + Check P2 P3 P4

A + Check B + Check C + Check • • Need to record and replay

concurrently

Both recording and replaying should be

efficient

Dongyoon Lee 3

Past Solutions for Deterministic Replay

Uniprocessor Replay

• • Program Input (e.g. system calls, signals, etc) Thread scheduling Multiprocessor Replay:

+ Shared memory dependencies

•

Instrument every memory operation

• • PinSEL [Pereira, IISWC’08] , iDNA [Bhansali, VEE’06]

→ 10-100x Page protection

SMP-ReVirt [Dunlap, VEE’08]

→ 2-9x Offline search

ODR [Altekar, SOSP’09] , PRES [Park, SOSP’09] Replay-SAT [Lee, MICRO’09]

→ Slow replay

•

Hardware support

FDR [Xu, ISCA’03] , Strata [Narayanasamy, ASPLOS’06] , ReRun [Hower, ISCA’08] , DeLorean [Montesinos, ISCA’08]

→ Custom HW

Dongyoon Lee 4

Overview of Our Approach

Goal: Efficient

online

software-only multiprocessor replay T1

Checkpoint A

Lock(l) Unlock(l) Lock(l) multi-threaded fork Speculate Race free Checkpoint B

T1’

A’ Lock(l’) Unlock(l’)

T2’

Lock(l’) Check B’==B?

Recorded Process Replayed Process

Key Idea: Speculation + Check

1) Speculate data race free 2) Detect mis-speculation using a cheap check 3) Rollback and retry on mis-speculation Dongyoon Lee 5

Roadmap

• • • •

Motivation/Overview

Respec Design

1. Speculate data race free 2. Detect mis-speculation 3. Rollback and Retry on mis-speculation

Evaluation Conclusion

Dongyoon Lee 6

Deterministic Replay of Data-race-free Programs

Observation

• Reproducing

program input

and happens-before

order of sync. operations

guarantees deterministic replay of data-race-free programs [Ronsse and Bosschere ’99]

1) Program input

( e.g. system calls, signals, etc. ) • Record: Log system call effects + total order • Replay: Emulate system call + total order

2) Synchronization Operations

• • Record and replay happens-before order Instrument common (not all) synchronization primitives in glibc Dongyoon Lee 7

What if a program is

NOT

race free?

Problem

• • Need to detect mis-speculation Data race detector is too heavy-weight

Insight:

External Determinism

is sufficient

• Not necessary to replay data races • Ensure that the replayed process produces the

same visible effects

as the recorded process to an external observer

Visible effects = System output + Final program state Solution:

Divergence checks

• Detect mis-speculation when the replay is not externally deterministic Dongyoon Lee 8

Divergence Check #1 – System Output

1) System Output Check

• • For every system call, compare

system call argument

Ensure that the replay produces the

same output

as the recorded process T1

Start A

multi-threaded fork

T1’ T2’

Start A’ Lock(l) Unlock(l) Lock(l’) Unlock(l’) Lock(l) Lock(l’) SysRead X SysRead X’

Recorded Process

SysWrite O

Dongyoon Lee Replayed Process

SysWrite O’ Check O’==O?

Benign Data Races

• • Not all races cause divergence checks to fail A data race is

inconsequential

if system output matches T1

Start A

x!=0?

multi-threaded fork x=1 x!=0?

T1’

Start A’

T2’

x=1 x!=0?

SysWrite(x)

Recorded Process Dongyoon Lee

SysWrite(x) Success

Replayed Process 10

Divergence due to Data Races

Start A

T2 T1’

Start A’

T2’

multi-threaded fork x=2 x=1 x=1 x=2 SysWrite(x) SysWrite(x) Fail Replayed Process Recorded Process

1) Need to rollback

to the beginning

2) Need to buffer system output

till the end

Dongyoon Lee 11

Divergence Check #2 – Program State

2) Program state check

• • Compare

register and memory state

at semi-regular intervals (epochs) Construct a

safe intermediate point

– To release buffered output – To rollback to in case of mis-speculation T1 T2 T1’

Start A Start A’

T2’

Checkpoint B x=2 Release Output x=1 SysWrite(x) Recorded Process

Dongyoon Lee

B’ == B ?

Success x=1 x=2 SysWrite(x) Fail Replayed Process

Recovery from Mis-speculation

Rollback

• Rollback both recorded and replayed processes to the previous checkpoint

Re-execute

• • Optimistically re-run the failed epoch On repeated failure, switch to uniprocessor execution model – Record and replay

only one thread at a time

– Parallel execution resumes after the failed interval T1 T2 T1’ T2’

Checkpoint B x=1 x=2 Check B’==B?

x=1 x=2 Fail

Dongyoon Lee 13

Speculative Execution

Speculator [Nightingale et al. SOSP’05]

• Buffer output during speculation • Block execution if speculative execution is not feasible • Release buffered output

on commit

• Undo speculative changes and squash buffered output

on mis-speculation

Dongyoon Lee 14

Roadmap

• • • •

Motivation/Overview Respec Design

Evaluation

1. Performance results 2. Breakdown of performance overhead 3. Rollback frequency and overhead

Conclusion

Dongyoon Lee 15

Evaluation Setup

Test Environment

• • • 2 GHz

8 core

Xeon processor with 3 GB of RAM Run

1~4 worker

threads (excluding control threads) Collect the average of 10 trials (except pbzip2 and aget)

Benchmarks

• PARSEC suite – blackscholes, bodytrack, fluidanimate, swaptions, streamcluster • SPLASH-2 suite – ocean, raytrace, volrend, water-nsq, fft, and radix • Real applications – pbzip2, pfscan, aget, and Apache Dongyoon Lee 16

Record and Replay Performance

0,8 0,6 0,4 0,2 0 2 1,8 1,6 1,4 1,2 1 1 2 3 4 1 2 3 4 1 2 4 1 2 3 4 1 2 3 4 1 2 4 1 2 3 4 1 2 3 4 1 2 3 4 1 2 4 1 2 4 1 2 3 4 1 2 3 4 1 2 3 4 1 2 3 4 black scholes body track fluid animate swap tions stream cluster ocean raytrace volrend water nsq fft radix pfscan pbzip2 aget Apache •

18%

for 2 threads,

55%

for 4 threads • Real applications (including

Apache

) showed <50% for 4 threads Dongyoon Lee 17

1) Redundant Execution Overhead (25%)

0,8 0,6 0,4 0,2 0 2 1,8 1,6 1,4 1,2 1

Redundant execution overhead

(25%) 1 2 3 4 1 2 3 4 1 2 4 1 2 3 4 1 2 3 4 1 2 4 1 2 3 4 1 2 3 4 1 2 3 4 1 2 4 1 2 4 1 2 3 4 1 2 3 4 1 2 3 4 1 2 3 4 black scholes body track fluid animate swap tions stream cluster ocean raytrace volrend water nsq fft radix pfscan pbzip2 aget Apache • Cost of running two executions (Lower bound of online replay) • Mainly due to sharing limited resources: memory system • Contribute

25%

of total cost for 4 threads Dongyoon Lee 18

2) Epoch overhead (17%)

0,8 0,6 0,4 0,2 0 2 1,8 1,6 1,4 1,2 1 Redundant execution overhead (25%)

Epoch overhead

(17%) 1 2 3 4 1 2 3 4 1 2 4 1 2 3 4 1 2 3 4 1 2 4 1 2 3 4 1 2 3 4 1 2 3 4 1 2 4 1 2 4 1 2 3 4 1 2 3 4 1 2 3 4 1 2 3 4 black scholes body track fluid animate swap tions stream cluster ocean raytrace volrend water nsq fft radix pfscan pbzip2 aget Apache • Due to checkpoint cost • Due to artificial epoch barrier cost • Contribute

17%

of total cost for 4 threads Dongyoon Lee 19

3) Memory Comparison Overhead (16%)

0,8 0,6 0,4 0,2 0 2 1,8 1,6 1,4 1,2 1 Redundant execution overhead (25%) Epoch overhead (17%)

Memory comparison overhead

(16%) 1 2 3 4 1 2 3 4 1 2 4 1 2 3 4 1 2 3 4 1 2 4 1 2 3 4 1 2 3 4 1 2 3 4 1 2 4 1 2 4 1 2 3 4 1 2 3 4 1 2 3 4 1 2 3 4 black scholes body track fluid animate swap tions stream cluster ocean raytrace volrend water nsq fft radix pfscan pbzip2 aget Apache • Optimization 1. compare dirty pages only • Optimization 2. parallelize comparison • Contribute

16%

of total cost for 4 threads Dongyoon Lee 20

4) Logging Overhead (42%)

0,8 0,6 0,4 0,2 0 2 1,8 1,6 1,4 1,2 1 Redundant execution overhead (25%) Epoch overhead (17%) Memory comparison overhead (16%)

Logging and other overhead

(42%) 1 2 3 4 1 2 3 4 1 2 4 1 2 3 4 1 2 3 4 1 2 4 1 2 3 4 1 2 3 4 1 2 3 4 1 2 4 1 2 4 1 2 3 4 1 2 3 4 1 2 3 4 1 2 3 4 black scholes body track fluid animate swap tions stream cluster ocean raytrace volrend water nsq fft radix pfscan pbzip2 aget Apache • Logging synchronization operations and system calls overhead • Main cost for applications with fine-grained synchronizations • Contribute

42%

of total cost for 4 threads Dongyoon Lee 21

Rollback Frequency and Overhead

App.

Pbzip2

(100 runs)

Aget

(50 runs)

Threads

4 4

Rollback Frequency

84% none

15% once 1% twice

80% none

18% once 2% twice Overhead

41% 66% 105% 6% 6% 6%

Avg. Overhead

45% 6% • Pbzip2(

16%

) and Aget(

20%

) invoke one or more rollbacks • Pbzip2: Rollbacks contribute

<10%

of total overhead • Aget: Rollback overhead is

negligible

• frequent checkpoints => short epochs => small amount of work to be re-done Dongyoon Lee 22

Conclusion

Goal: Deterministic replay for multithreaded programs • •

Software-only

: no custom hardware

Online

: record and replay concurrently

Contributions to replay

•

Speculation

: speculate race-free, and rollback/retry if needed •

External Determinism

: Match system output and program states

Results

• Performance overhead

record and replay

concurrently • 2 threads:

18%

• 4 threads:

55% Thank you…

Dongyoon Lee 23

Thank you

Dongyoon Lee 24

Benign Data Races

Benign data races could cause frequent rollbacks

• Performance (

NOT

correctness) issue • The latest Java and C++ memory model

prohibits

benign races => There are only

harmful

races [Manson et al. POPL’05],[Boehm et al. PLDI’08] • Programmers should explicitly

annotate

intentionally racy variables (e.g. handcrafted synchronization) using volatile/atomic keywords • Could automatically detect and instrument Dongyoon Lee 25

Implementation

Modify Linux 2.6.27 kernel

• Deterministic replay • Multithreaded fork • Record/replay program input (e.g. system calls, signals, …) • Compare program state (memory and register contents) • Speculator [Nightingale et al. SOSP’05] • Checkpoint and rollback • Buffer system output or propagate speculative states

Modify glibc 2.5.1

• Support recording/replaying low-level synchronization operations • e.g. locks, unlock, futex waits, futex wakes Dongyoon Lee 26

Handling System Calls

Replayed process

Emulate

• most system calls Feed logged return value and data copied into the process 2)

Re-execute

• some system calls Create or delete threads : clone, exit, … • Modify address space: mmap2, mprotect, …

Problem

• Does

NOT

recreate most kernel state associated with the replayed process (e.g. the file descriptor table) • Process can

NOT

transition from replaying to live execution

Solution

• Recreate the OS state by

re-executing

native/virtualized system calls ReVirt [Dunlap et al. OSDI’02], Zap [Osman et al. OSDI’02] Dongyoon Lee 27

Multi-threaded Fork (Checkpoint)

Copy-on-write fork

• Linux’s fork supports fork of only single thread • Need new copy-on-write primitive for checkpointing multithreads • Should checkpoint a thread at safe point • kernel entry/exit (system call)

Multi-threaded fork

1) The initiating thread that initiates a multithreaded fork creates a barrier on which it waits until all other threads reach a safe point 2) Once all threads reach the barrier, the original thread creates the checkpoint, then let other threads continue execution.

Semi-regular checkpoints

• Adaptive epoch length • To bound the amount of work that must be redone on rollback • Output triggered commit • To provide acceptable latency for interactive tasks Dongyoon Lee 28

Benefits of Program State Check

1) Allow Respec to commit epochs and release system output • • Buffer output during speculation Safe to release output on commit after matching program state 2) Reduce the amount of execution that must be re-done when a check fails 3) Allow broader uses of replay system • Tolerating non-fail-stop faults (e.g. transient hardware fault) • Need to detect latent faults • Parallelizing security and reliability checks Dongyoon Lee 29

Offline Replay with Respec

Respec Log

• • Kernel’s system call + User-level synchronizations MD5 checksum of address space and register state Problem: Not all races are logged • Offline replay is

NOT guaranteed

to succeed • Since the recorded process has been replayed successfully at least once, it is likely that offline replay will eventually succeed

Solution

• Offline replay search tools can be used e.g. ODR [Altekar et al. SOSP’09] , PRES [Park et al. SOSP’09] , Replay-SAT [Lee et al. MICRO’09] Dongyoon Lee 30

Non-Deterministic Program Input

• e.g. I/O, DMA, interrupts, signals, RDTSC, context-switch, page fault • Asynchronous interrupts (caused by external sources) • eg. I/O, timer, disk read completion • Synchronous interrupts (=traps) • eg. arithmetic overflow exceptions, invoking system calls, page fault, TLB miss • x86 instructions (can return non-deterministic results, but do not normally trap when running in user mode) • eg. rdtsc(read timestamp counter), rdpmc(read performance monitoring counter) Dongyoon Lee 31

Rollback Frequency and Overhead (Pbzip2)

Threads

1 2 3 4

Rollback Frequency

0% 13% once 9% once 2% twice 84% no rollback 15% once 1% twice

Original Time (sec)

4.59

2.35

1.64

1.33

Type

Overall w/o rollback w/ rollback overall w/o rollback w/ rollback overall w/o rollback w/ rollback overall

Respec Time (sec)

4.83

2.70

2.97

2.73

2.00

2.29

1.03

1.88

2.24

1.93

• • Out of 100 runs, 13-16% of executions invoke more than one rollbacks Rollbacks contribute 8% of Respec's total overhead Dongyoon Lee

Slowdown

5% 15% 26% 16% 22% 40% 24% 41% 68% 45% 32

Rollback Frequency and Overhead (Aget)

Threads Rollback Frequency Original Time (sec) Type

1 2 3 4 10% once 2% twice 20% once 2% twice 24% once 18% once 2% twice 2.05

1.93

1.94

1.96

w/o rollback w/ rollback overall w/o rollback w/ rollback overall w/o rollback w/ rollback overall w/o rollback w/ rollback overall • • Out of 50 runs, 14-24% of executions invoke more than one rollbacks Peformance impact is negligible (due to very frequent checkpoint)

Respec Time (sec)

2.19

2.21

2.19

2.17

2.08

2.09

2.08

2.07

2.08

Slowdown

33 7% 8% 7% 13% 13% 13% 7% 8% 7% 6% 6% 6%

Respec: Efficient Online Multiprocessor Replay via Speculation and External Determinism Dongyoon Lee, Benjamin Wester, Kaushik Veeraraghavan, Satish Narayanasamy, Peter M.

Transcript Respec: Efficient Online Multiprocessor Replay via Speculation and External Determinism Dongyoon Lee, Benjamin Wester, Kaushik Veeraraghavan, Satish Narayanasamy, Peter M.