Respec: Efficient Online Multiprocessor Replay via Speculation and External Determinism Dongyoon Lee, Benjamin Wester, Kaushik Veeraraghavan, Satish Narayanasamy, Peter M.

Download Report

Transcript Respec: Efficient Online Multiprocessor Replay via Speculation and External Determinism Dongyoon Lee, Benjamin Wester, Kaushik Veeraraghavan, Satish Narayanasamy, Peter M.

Respec: Efficient Online Multiprocessor Replay via Speculation and External Determinism Dongyoon Lee

, Benjamin Wester, Kaushik Veeraraghavan, Satish Narayanasamy, Peter M. Chen, and Jason Flinn

University of Michigan, Ann Arbor

Deterministic Replay Deterministic Replay

• Record and reproduce non-deterministic events

1) Offline

Uses: replay repeatedly after original run • • Debugging Forensics

2) Online

Uses: record and replay concurrently • Fault tolerance • Decoupled runtime checks

We focus on online replay for multi-processors

Dongyoon Lee 2

Online Deterministic Replay Uses

Fault Tolerance Decoupled Runtime Checks

Request

Server

log Response

Fault !!

Replica

replay keep the same state

Takeover App P1

A B C

Replay + Check P2 P3 P4

A + Check B + Check C + Check • • Need to record and replay

concurrently

Both recording and replaying should be

efficient

Dongyoon Lee 3

Past Solutions for Deterministic Replay

Uniprocessor Replay

• • Program Input (e.g. system calls, signals, etc) Thread scheduling Multiprocessor Replay:

+ Shared memory dependencies

Instrument every memory operation

• • PinSEL [Pereira, IISWC’08] , iDNA [Bhansali, VEE’06]

→ 10-100x Page protection

SMP-ReVirt [Dunlap, VEE’08]

→ 2-9x Offline search

ODR [Altekar, SOSP’09] , PRES [Park, SOSP’09] Replay-SAT [Lee, MICRO’09]

→ Slow replay

Hardware support

FDR [Xu, ISCA’03] , Strata [Narayanasamy, ASPLOS’06] , ReRun [Hower, ISCA’08] , DeLorean [Montesinos, ISCA’08]

→ Custom HW

Dongyoon Lee 4

Overview of Our Approach

Goal: Efficient

online

software-only multiprocessor replay T1

Checkpoint A

T2

Lock(l) Unlock(l) Lock(l) multi-threaded fork Speculate Race free Checkpoint B

T1’

A’ Lock(l’) Unlock(l’)

T2’

Lock(l’) Check B’==B?

Recorded Process Replayed Process

Key Idea: Speculation + Check

1) Speculate data race free 2) Detect mis-speculation using a cheap check 3) Rollback and retry on mis-speculation Dongyoon Lee 5

Roadmap

• • • •

Motivation/Overview

Respec Design

1. Speculate data race free 2. Detect mis-speculation 3. Rollback and Retry on mis-speculation

Evaluation Conclusion

Dongyoon Lee 6

Deterministic Replay of Data-race-free Programs

Observation

• Reproducing

program input

and happens-before

order of sync. operations

guarantees deterministic replay of data-race-free programs [Ronsse and Bosschere ’99]

1) Program input

( e.g. system calls, signals, etc. ) • Record: Log system call effects + total order • Replay: Emulate system call + total order

2) Synchronization Operations

• • Record and replay happens-before order Instrument common (not all) synchronization primitives in glibc Dongyoon Lee 7

What if a program is

NOT

race free?

Problem

• • Need to detect mis-speculation Data race detector is too heavy-weight

Insight:

External Determinism

is sufficient

• Not necessary to replay data races • Ensure that the replayed process produces the

same visible effects

as the recorded process to an external observer

Visible effects = System output + Final program state Solution:

Divergence checks

• Detect mis-speculation when the replay is not externally deterministic Dongyoon Lee 8

Divergence Check #1 – System Output

1) System Output Check

• • For every system call, compare

system call argument

Ensure that the replay produces the

same output

as the recorded process T1

Start A

T2

multi-threaded fork

T1’ T2’

Start A’ Lock(l) Unlock(l) Lock(l’) Unlock(l’) Lock(l) Lock(l’) SysRead X SysRead X’

Recorded Process

SysWrite O

Dongyoon Lee Replayed Process

SysWrite O’ Check O’==O?

9

Benign Data Races

• • Not all races cause divergence checks to fail A data race is

inconsequential

if system output matches T1

Start A

T2

x!=0?

x!=0?

multi-threaded fork x=1 x!=0?

T1’

Start A’

T2’

x=1 x!=0?

SysWrite(x)

Recorded Process Dongyoon Lee

SysWrite(x) Success

Replayed Process 10

Divergence due to Data Races

T1

Start A

T2 T1’

Start A’

T2’

multi-threaded fork x=2 x=1 x=1 x=2 SysWrite(x) SysWrite(x) Fail Replayed Process Recorded Process

1) Need to rollback

to the beginning

2) Need to buffer system output

till the end

Dongyoon Lee 11

Divergence Check #2 – Program State

2) Program state check

• • Compare

register and memory state

at semi-regular intervals (epochs) Construct a

safe intermediate point

– To release buffered output – To rollback to in case of mis-speculation T1 T2 T1’

Start A Start A’

T2’

Checkpoint B x=2 Release Output x=1 SysWrite(x) Recorded Process

Dongyoon Lee

B’ == B ?

Success x=1 x=2 SysWrite(x) Fail Replayed Process

12

Recovery from Mis-speculation

Rollback

• Rollback both recorded and replayed processes to the previous checkpoint

Re-execute

• • Optimistically re-run the failed epoch On repeated failure, switch to uniprocessor execution model – Record and replay

only one thread at a time

– Parallel execution resumes after the failed interval T1 T2 T1’ T2’

Checkpoint B x=1 x=2 Check B’==B?

x=1 x=2 Fail

Dongyoon Lee 13

Speculative Execution

Speculator [Nightingale et al. SOSP’05]

• Buffer output during speculation • Block execution if speculative execution is not feasible • Release buffered output

on commit

• Undo speculative changes and squash buffered output

on mis-speculation

Dongyoon Lee 14

Roadmap

• • • •

Motivation/Overview Respec Design

Evaluation

1. Performance results 2. Breakdown of performance overhead 3. Rollback frequency and overhead

Conclusion

Dongyoon Lee 15

Evaluation Setup

Test Environment

• • • 2 GHz

8 core

Xeon processor with 3 GB of RAM Run

1~4 worker

threads (excluding control threads) Collect the average of 10 trials (except pbzip2 and aget)

Benchmarks

• PARSEC suite – blackscholes, bodytrack, fluidanimate, swaptions, streamcluster • SPLASH-2 suite – ocean, raytrace, volrend, water-nsq, fft, and radix • Real applications – pbzip2, pfscan, aget, and Apache Dongyoon Lee 16

Record and Replay Performance

0,8 0,6 0,4 0,2 0 2 1,8 1,6 1,4 1,2 1 1 2 3 4 1 2 3 4 1 2 4 1 2 3 4 1 2 3 4 1 2 4 1 2 3 4 1 2 3 4 1 2 3 4 1 2 4 1 2 4 1 2 3 4 1 2 3 4 1 2 3 4 1 2 3 4 black scholes body track fluid animate swap tions stream cluster ocean raytrace volrend water nsq fft radix pfscan pbzip2 aget Apache •

18%

for 2 threads,

55%

for 4 threads • Real applications (including

Apache

) showed <50% for 4 threads Dongyoon Lee 17

1) Redundant Execution Overhead (25%)

0,8 0,6 0,4 0,2 0 2 1,8 1,6 1,4 1,2 1

Redundant execution overhead

(25%) 1 2 3 4 1 2 3 4 1 2 4 1 2 3 4 1 2 3 4 1 2 4 1 2 3 4 1 2 3 4 1 2 3 4 1 2 4 1 2 4 1 2 3 4 1 2 3 4 1 2 3 4 1 2 3 4 black scholes body track fluid animate swap tions stream cluster ocean raytrace volrend water nsq fft radix pfscan pbzip2 aget Apache • Cost of running two executions (Lower bound of online replay) • Mainly due to sharing limited resources: memory system • Contribute

25%

of total cost for 4 threads Dongyoon Lee 18

2) Epoch overhead (17%)

0,8 0,6 0,4 0,2 0 2 1,8 1,6 1,4 1,2 1 Redundant execution overhead (25%)

Epoch overhead

(17%) 1 2 3 4 1 2 3 4 1 2 4 1 2 3 4 1 2 3 4 1 2 4 1 2 3 4 1 2 3 4 1 2 3 4 1 2 4 1 2 4 1 2 3 4 1 2 3 4 1 2 3 4 1 2 3 4 black scholes body track fluid animate swap tions stream cluster ocean raytrace volrend water nsq fft radix pfscan pbzip2 aget Apache • Due to checkpoint cost • Due to artificial epoch barrier cost • Contribute

17%

of total cost for 4 threads Dongyoon Lee 19

3) Memory Comparison Overhead (16%)

0,8 0,6 0,4 0,2 0 2 1,8 1,6 1,4 1,2 1 Redundant execution overhead (25%) Epoch overhead (17%)

Memory comparison overhead

(16%) 1 2 3 4 1 2 3 4 1 2 4 1 2 3 4 1 2 3 4 1 2 4 1 2 3 4 1 2 3 4 1 2 3 4 1 2 4 1 2 4 1 2 3 4 1 2 3 4 1 2 3 4 1 2 3 4 black scholes body track fluid animate swap tions stream cluster ocean raytrace volrend water nsq fft radix pfscan pbzip2 aget Apache • Optimization 1. compare dirty pages only • Optimization 2. parallelize comparison • Contribute

16%

of total cost for 4 threads Dongyoon Lee 20

4) Logging Overhead (42%)

0,8 0,6 0,4 0,2 0 2 1,8 1,6 1,4 1,2 1 Redundant execution overhead (25%) Epoch overhead (17%) Memory comparison overhead (16%)

Logging and other overhead

(42%) 1 2 3 4 1 2 3 4 1 2 4 1 2 3 4 1 2 3 4 1 2 4 1 2 3 4 1 2 3 4 1 2 3 4 1 2 4 1 2 4 1 2 3 4 1 2 3 4 1 2 3 4 1 2 3 4 black scholes body track fluid animate swap tions stream cluster ocean raytrace volrend water nsq fft radix pfscan pbzip2 aget Apache • Logging synchronization operations and system calls overhead • Main cost for applications with fine-grained synchronizations • Contribute

42%

of total cost for 4 threads Dongyoon Lee 21

Rollback Frequency and Overhead

App.

Pbzip2

(100 runs)

Aget

(50 runs)

Threads

4 4

Rollback Frequency

84% none

15% once 1% twice

80% none

18% once 2% twice Overhead

41% 66% 105% 6% 6% 6%

Avg. Overhead

45% 6% • Pbzip2(

16%

) and Aget(

20%

) invoke one or more rollbacks • Pbzip2: Rollbacks contribute

<10%

of total overhead • Aget: Rollback overhead is

negligible

• frequent checkpoints => short epochs => small amount of work to be re-done Dongyoon Lee 22

Conclusion

Goal: Deterministic replay for multithreaded programs • •

Software-only

: no custom hardware

Online

: record and replay concurrently

Contributions to replay

Speculation

: speculate race-free, and rollback/retry if needed •

External Determinism

: Match system output and program states

Results

• Performance overhead

record and replay

concurrently • 2 threads:

18%

• 4 threads:

55% Thank you…

Dongyoon Lee 23

Thank you

Dongyoon Lee 24

Benign Data Races

Benign data races could cause frequent rollbacks

• Performance (

NOT

correctness) issue • The latest Java and C++ memory model

prohibits

benign races => There are only

harmful

races [Manson et al. POPL’05],[Boehm et al. PLDI’08] • Programmers should explicitly

annotate

intentionally racy variables (e.g. handcrafted synchronization) using volatile/atomic keywords • Could automatically detect and instrument Dongyoon Lee 25

Implementation

Modify Linux 2.6.27 kernel

• Deterministic replay • Multithreaded fork • Record/replay program input (e.g. system calls, signals, …) • Compare program state (memory and register contents) • Speculator [Nightingale et al. SOSP’05] • Checkpoint and rollback • Buffer system output or propagate speculative states

Modify glibc 2.5.1

• Support recording/replaying low-level synchronization operations • e.g. locks, unlock, futex waits, futex wakes Dongyoon Lee 26

Handling System Calls

Replayed process

1)

Emulate

• most system calls Feed logged return value and data copied into the process 2)

Re-execute

• some system calls Create or delete threads : clone, exit, … • Modify address space: mmap2, mprotect, …

Problem

• Does

NOT

recreate most kernel state associated with the replayed process (e.g. the file descriptor table) • Process can

NOT

transition from replaying to live execution

Solution

• Recreate the OS state by

re-executing

native/virtualized system calls ReVirt [Dunlap et al. OSDI’02], Zap [Osman et al. OSDI’02] Dongyoon Lee 27

Multi-threaded Fork (Checkpoint)

Copy-on-write fork

• Linux’s fork supports fork of only single thread • Need new copy-on-write primitive for checkpointing multithreads • Should checkpoint a thread at safe point • kernel entry/exit (system call)

Multi-threaded fork

1) The initiating thread that initiates a multithreaded fork creates a barrier on which it waits until all other threads reach a safe point 2) Once all threads reach the barrier, the original thread creates the checkpoint, then let other threads continue execution.

Semi-regular checkpoints

• Adaptive epoch length • To bound the amount of work that must be redone on rollback • Output triggered commit • To provide acceptable latency for interactive tasks Dongyoon Lee 28

Benefits of Program State Check

1) Allow Respec to commit epochs and release system output • • Buffer output during speculation Safe to release output on commit after matching program state 2) Reduce the amount of execution that must be re-done when a check fails 3) Allow broader uses of replay system • Tolerating non-fail-stop faults (e.g. transient hardware fault) • Need to detect latent faults • Parallelizing security and reliability checks Dongyoon Lee 29

Offline Replay with Respec

Respec Log

• • Kernel’s system call + User-level synchronizations MD5 checksum of address space and register state Problem: Not all races are logged • Offline replay is

NOT guaranteed

to succeed • Since the recorded process has been replayed successfully at least once, it is likely that offline replay will eventually succeed

Solution

• Offline replay search tools can be used e.g. ODR [Altekar et al. SOSP’09] , PRES [Park et al. SOSP’09] , Replay-SAT [Lee et al. MICRO’09] Dongyoon Lee 30

Non-Deterministic Program Input

• e.g. I/O, DMA, interrupts, signals, RDTSC, context-switch, page fault • Asynchronous interrupts (caused by external sources) • eg. I/O, timer, disk read completion • Synchronous interrupts (=traps) • eg. arithmetic overflow exceptions, invoking system calls, page fault, TLB miss • x86 instructions (can return non-deterministic results, but do not normally trap when running in user mode) • eg. rdtsc(read timestamp counter), rdpmc(read performance monitoring counter) Dongyoon Lee 31

Rollback Frequency and Overhead (Pbzip2)

Threads

1 2 3 4

Rollback Frequency

0% 13% once 9% once 2% twice 84% no rollback 15% once 1% twice

Original Time (sec)

4.59

2.35

1.64

1.33

Type

Overall w/o rollback w/ rollback overall w/o rollback w/ rollback overall w/o rollback w/ rollback overall

Respec Time (sec)

4.83

2.70

2.97

2.73

2.00

2.29

1.03

1.88

2.24

1.93

• • Out of 100 runs, 13-16% of executions invoke more than one rollbacks Rollbacks contribute 8% of Respec's total overhead Dongyoon Lee

Slowdown

5% 15% 26% 16% 22% 40% 24% 41% 68% 45% 32

Rollback Frequency and Overhead (Aget)

Threads Rollback Frequency Original Time (sec) Type

1 2 3 4 10% once 2% twice 20% once 2% twice 24% once 18% once 2% twice 2.05

1.93

1.94

1.96

w/o rollback w/ rollback overall w/o rollback w/ rollback overall w/o rollback w/ rollback overall w/o rollback w/ rollback overall • • Out of 50 runs, 14-24% of executions invoke more than one rollbacks Peformance impact is negligible (due to very frequent checkpoint)

Respec Time (sec)

2.19

2.21

2.19

2.17

2.17

2.17

2.08

2.09

2.08

2.07

2.08

2.08

Slowdown

33 7% 8% 7% 13% 13% 13% 7% 8% 7% 6% 6% 6%