Speculative Execution in a Distributed File System Ed Nightingale Peter Chen

Download Report

Transcript Speculative Execution in a Distributed File System Ed Nightingale Peter Chen

Speculative Execution in a
Distributed File System
Ed Nightingale
Peter Chen
Jason Flinn
University of Michigan
Best Paper at SOSP 2005
Modified for CS739
Motivation
• Why are distributed file systems slow(er)?
– Sync network messages provide consistency
– Sync disk writes provide safety
• Sacrifice guarantees for speed
• Can DFS can be safe, consistent and fast?
– Yes! With OS support for speculative execution
2
Big Idea: Speculator
Slow Way
Client
Server
1) Checkpoint
2) Speculate!
Block!
3) Correct?
No:
Yes:restore
discardprocess
ckpt.
& re-execute
RPC Req
RPC Resp
RPC Req
RPC Resp
• Guarantees without blocking I/O!
3
Conditions for Success
• Operations are highly predictable
– Conflicts are rare
• Checkpoints are cheaper than network I/O
– 52 µs for small process
(6.3 ms for 64 MB process)
• Computers have resources to spare
– Need memory and CPU cycles for speculation
4
Assumptions --> Experiments
• Operations are highly predictable
– What happens if not (lots of sharing)?
– Will cost of checkpointing and rolling back ever harm
performance?
• Checkpoints are cheaper than network I/O
– What happens with large memory processes?
• Computers have resources to spare (Clients)
– What happens when running multiple processes?
– Can network become bottleneck?
– Can server become bottleneck?
5
Outline
• Motivation
• Implementing speculation
• Multi-process speculation
• Using Speculator
• Evaluation
6
Implementing Speculation
• Implementation
– All within OS
– No changes needed to applications
– Three new interfaces
• Create_speculation()
• Commit_speculation()
• Fail_speculation()
• Goal: Ensure speculative state is
• Never externalized (sent to terminal, disk, network)
• Never directly observed by non-speculative processes
7
Implementing Speculation
1) System call
2) Create speculation
Time
Copy-on-write fork Checkpoint
For each kernel object
Undo log
Spec
Tracks objects that depend
on speculation
8
Speculation Success
1) System call
2) Create speculation
3) Commit speculation
Time
Checkpoint
Undo log
Spec
9
Speculation Failure
1) System call
2) Create speculation
3) Fail speculation
Time
Checkpoint
Undo log
Spec
10
Replay after Failure
• Does code after checkpoint need to be
deterministic for correct results?
11
Ensuring Correctness
• Spec processes often affect external state
– Speculative state should never be visible to user or
any external device
– Process should never view speculative state unless it
is already speculatively dependent on that state
• Three ways to ensure correct execution
– Block
– Buffer
– Propagate speculations (dependencies)
12
Systems Calls
• Modify system call jump table
• Block calls that externalize state
– Allow read-only calls (e.g. getpid)
– Allow calls that modify only task state (e.g. dup2)
• File system calls -- need to dig deeper
– Mark file systems that support Speculator
getpid
Call sys_getpid()
reboot
mkdir
Block until specs resolved
Allow only if fs supports Speculator
13
Output Commits: Buffer
1) sys_stat
2) sys_mkdir
3) Commit spec 1
Time
“stat worked”
Checkpoint
Spec
(stat)
“mkdir worked”
Checkpoint
Spec
(mkdir)
14
Undo log
Multi-Process Speculation
• Processes often cooperate
– Example: “make” forks children to compile, link, etc.
– Would block if speculation limited to one task
• Allow kernel objects to have speculative state
– Examples: inodes, signals, pipes, Unix sockets, etc.
– Propagate dependencies among objects
– Objects rolled back to prior states when specs fail
15
Multi-Process Speculation
Checkpoint
Stat A
Stat B
Spec 1
Spec 2
Checkpoint
Checkpoint
pid 8000
pid 8001
Chown-1
Write-1
inode 3456
16
Multi-Process Speculation
• What we handle:
– DFS objects, RAMFS, Ext3, Pipes & FIFOs
– Unix Sockets, Signals, Fork & Exit
• What we don’t (i.e. we block)
– System V IPC
– Multi-process write-shared memory
17
Outline
• Motivation
• Implementing speculation
• Multi-process speculation
• Using Speculator
• Evaluation
18
Example: NFSv3 Linux
Client 1
Server
Client 2
Write
Modify B
Commit
Getattr Open B
Why were asynchronous writes added to NFSv3?
Is this situation of no overlap expected to be the common case?
When can the commit return from the server?
19
Example: SpecNFS
Client 1
Modify B
Server
Client 2
Write+Commit
speculate
Getattr
Open B
Getattr
speculate
Open B
speculate
When can the commit return from the server?
20
Problem: Mutating Operations
Client 1
Client 2
1. cat foo > bar
2. cat bar
• bar depends on cat foo
• What does client 2 view in bar?
21
Solution: Mutating Operations
• Server determines speculation success/failure
– State at server never speculative
• Send server hypothesis speculation based on
– List of speculations an operation depends on
• Requires server to track failed speculations
– Would like to be convinced OK when server crashes!
• Requires in-order processing of messages
22
Group Commit
• Previously sequential ops now concurrent
• Sync ops usually committed to disk
• Speculator makes group commit possible
Updating different files…
Client
Server
Client
Server
write
commit
write
commit
Can significantly improve disk throughput
23
Putting it all Together: SpecNFS
• Apply Speculator to an existing file system
• Modified NFSv3 in Linux 2.4 kernel
–
–
–
–
Same RPCs issued (but many now asynchronous)
SpecNFS has same consistency, safety as NFS
Getattr, lookup, access speculate if data in cache
Create, mkdir, commit, etc. always speculate
• Choose aliases for unknown file handles
24
Putting it all Together: BlueFS
• Design a new file system for Speculator
– Single copy semantics
– Synchronous I/O
• Each file, directory, etc. has version number
– Incremented on each mutating op (e.g. on write)
– Checked prior to all operations.
– Many ops speculate and check version async
25
Outline
• Motivation
• Implementing speculation
• Multi-process speculation
• Using Speculator
• Evaluation
26
Apache Benchmark
300
4500
NFS
SpecNFS
BlueFS
ext3
Time (seconds)
250
4000
3500
3000
200
2500
150
2000
1500
100
1000
50
500
0
0
No delay
• SpecNFS up to 14 times faster
30 ms delay
27
Apache Benchmark
300
4500
4000
Remove
Configure
3500
200
Make
Untar
3000
2500
150
2000
100
1500
1000
50
500
ex
t3
od
a
C
FS
Sp
ec
B
FS
Sp
ec
N
N
ex
t3
od
a
C
FS
FS
FS
Sp
ec
N
Sp
ec
B
No delay
FS
0
0
N
Time (seconds)
250
30ms delay
28
Time (seconds)
The Cost of Rollback (Q1)
140
2000
120
1800
1600
100
1400
80
1200
60
1000
800
40
600
20
400
200
0
0
NFS
SpecNFS
No delay
ext3
No files invalid
10% files invalid
50% files invalid
100% files invalid
NFS
SpecNFS
ext3
30ms delay
• All files out of date SpecNFS up to 11x faster
29
Evaluation
• Other experiments you would like to see?
30
Evaluation
• Other experiments you would like to see?
• Impact of checkpoint size
• Resource constraints: Scalability
– How many clients can server handle?
– Is network a constraint?
– Server must still handle poll requests, even if in bg
• Fault tolerance
– Server now maintains state: per-client list of failed
speculations
31
Conclusion
• Speculator greatly improves performance of
existing distributed file systems
– Especially in wide-area
• Speculator enables new file systems to be safe,
consistent and fast
• How general do you think Speculator is?
– Speculation clearly improves poll-based consistency
(NFSv3, BlueFS)
– How much improvement with callbacks and leases?
32
(AFS, Coda, NFSv4)?
Time (seconds)
Group Commit & Sharing State
4500
500
450
400
350
300
250
200
150
100
50
0
4000
Default
3500
No prop
3000
No grp commit
2500
No grp commit & no prop
2000
1500
1000
500
NFS
SpecNFS
0 ms delay
BlueFS
0
NFS
SpecNFS
BlueFS
30ms delay
33
Related Work
• Chang & Gibson, Fraser & Chang
– Speculative pre-fetching
• Time Warp
– Virtual Time: distributed simulations
• Hardware branch prediction
• Transactional file systems
34