Byzantine Fault Isolation in the Farsite Distributed File System

Download Report

Transcript Byzantine Fault Isolation in the Farsite Distributed File System

Byzantine Fault Isolation in the
Farsite Distributed File System
John R. Douceur and Jon Howell
Definitions
Byzantine fault \'biz-ən- tēn folt\
˙ n (1982) : a failure of a
system component that produces arbitrary behavior
'
'
Byzantine fault isolation \'biz-ən- tēn folt
˙ ī-sə-'lā- shən\ n
(2006) : methodology for designing a distributed system
that can, under Byzantine failure, operate with applicationdefined partial correctness
'
'
'
'
BFI \ bē-ef-'ī\ n (2006) : Byzantine fault isolation
Farsite \'fär-sīt\ n (2000) : serverless distributed file system
developed at Microsoft Research, designed to be scalable,
strongly consistent, and secure despite running on an
untrusted infrastructure of desktop PCs
Talk Outline
•
•
•
•
•
•
Context – Farsite system
Why BFT doesn’t scale
Farsite’s use of multiple BFT groups
The need for isolating Byzantine faults
Formal system specification
BFI in Farsite
Farsite System
client
server
server
client
server
Farsite System – Metadata
metadata
users
clients
BFT group
Farsite System – Metadata
T = tolerable faults
R = count of replicas
R>3T
•Using Byzantine
agreement protocol,
assign sequence
numbers to messages
•Prepare-commit
among 2 T + 1 servers
•Deterministically
update metadata
users
clients
BFT group
•Reply to client
The Cost of BFT Groups
1
computation
4
2
message delays
5
2
messages
32
Throughput vs. Scale
throughput multiple
7
6
5
4
3
2
1
0
1
2
3
4
5
6
machine count
ideal
typical
flat
BFT
7
Workload Sharing
Workload
client
server
BFT at Scale
Multiple BFT Groups
Tree of BFT Groups
Tree of BFT Groups
/
public
users
emacs
cruft
Alice
Bob
Outlook
vi
code
docs
src
C++
C#
foo
bar
bin
src
Proj X
bin
Delegation to New Group
/
public
users
emacs
cruft
Alice
Bob
Outlook
vi
code
docs
src
C++
C#
foo
bar
bin
src
Proj X
bin
Pathname Resolution
/users/Alice/code/C#/bar
/
public
users
emacs
cruft
Alice
Bob
Outlook
vi
code
docs
src
C++
C#
foo
bar
bin
src
Proj X
bin
Machine Failures at Scale
Group Failures at Scale
System Failure at Scale
Quantitative Fault Analysis
• Example system
– File system distributed among interacting BFT groups
• Simplifying assumptions
– Files are partitioned evenly among BFT groups
– Machine failures are independent
• Machine fault probability = 0.001
• Evaluate: operational fault rate
– Probability that an operation on a randomly selected
file exhibits a fault
Operational Faults vs. System Scale
operational fault rate
10
10
10
10
10
10
10
10
0
–1
0.45
–2
–3
–4
610
–5
310
–6
–6
610
–5
–6
–7
1
10
100
1,000
10,000
100,000
system scale (count of BFT groups)
BFT 4, no BFI
BFT 4, ideal BFI
BFT 7, no BFI
BFT 4, tree (4) BFI
BFT 10, no BFI
BFT 4, tree (16) BFI
BFI versus no BFI
BFI versus no BFI
4-member BFT groups
with BFI
10-member BFT groups
without BFI
4
computation
 10
32
messages
200
throughput reduction:
60%
84%
BFI via Formal Specification
actions
+ faults
distributed
system
spec
refinement
ment
state
state
actions
+ faults
semantic
spec
Farsite Semantic Spec
/
tools
C++
cl.exe
code
emacs
a.h
src
a.cpp
bin
a.obj
read
a.exe
open
move
open handles
pending operations
Farsite Distributed-System Spec
Farsite Refinement
/
tools
C++
cl.exe
code
emacs
a.h
src
a.cpp
bin
a.obj
read
a.exe
del
move
open handles
pending operations
Actions are State Transitions
/
a.cpp
open
handles
pending
operations
Proving Refinement Inductively
/
a.cpp
open
handles
pending
operations
Refinement with Byzantine Faults
/
tools
C++
cl.exe
code
emacs
a.h
src
a.cpp
bin
a.obj
read
a.exe
del
move
open handles
pending operations
Refinement with Byzantine Faults
/
tools
C++
cl.exe
code
emacs
a.h
src
a.cpp
bin
a.obj
read
a.exe
del
move
open handles
pending operations
Semantic Fault Specification
• Safety
–
–
–
–
–
contents
andand
attributes
A tainted
taintedfile
filemay
mayhave
havearbitrary
arbitrary
contents
attributes
not
linked
intointo
namespace
A tainted
taintedfile
filemay
mayappear
appear
not
linked
namespace
notnot
to to
have
children
it actually
has has
A tainted
taintedfile
filemay
maypretend
pretend
have
children
it actually
to to
have
children
thatthat
do not
A tainted
taintedfile
filemay
maypretend
pretend
have
children
do exist
not exist
another
tainted
file file
is a is
child
or parent
A tainted
taintedfile
filemay
maypretend
pretend
another
tainted
a child
or parent
• Liveness
– Operations
Operations involving
a tainted
filefile
may
notnot
complete
involving
a tainted
may
complete
,.
,.
{^ \-~-/ ^}
"
"
,".
{ <o> _ <o> } / }
==_ .:Y:. _=={ {
_/ `--^--' \_} }
/ \
/ \ /
{
(
)
y
\ !
| |
! /
,-.i~ ~i i~ ~i,-.
(!!(
V
)!!)
^-'-'-^-'-'-^
,,)*&#()*&{
Hello
1[9^^x
**{
o world
[[ ….
2
%%% @@)
/
tools
C++
cl.exe
code
emacs
a.h
src
a.cpp
bin
a.obj
a.exe
foo
bar
Distributed-System Improvements
•
•
•
•
Maintain
redundantinfo
info
across
BFT
group
boundaries
Maintain redundant
across
BFT
group
boundaries
Augment messages
info
that
justifies
correctness
Augment
messageswith
with
info
that
justifies
correctness
Ensure
unambiguouschains
chains
authority
Ensure unambiguous
of of
authority
overover
datadata
Carefully
ordermessages
messagesand
and
state
updates
Carefully order
state
updates
for for
operations involving
multiple
BFT
groups
operations
involving
multiple
BFT
groups
Summary of BFI Methodology
• Formally specify your system
– Semantic spec: user’s view of system
– Distributed-system spec: designer’s view of system
– Refinement interprets distributed-system spec in
semantic terms
• Modify distributed-system spec to express
Byzantine faults
• Simultaneously
– Strategically weaken semantic spec to describe faults
– Improve distributed-system spec to quarantine faults
• Refinement lets you know when you are done
Conclusions
• BFT groups have negative throughput scaling
• Scalable systems can be built from multiple BFT groups
• System scale increases the probability of non-maskable
Byzantine faults
• If faults are not isolated, a single faulty group can corrupt
the entire system.
• BFI is a methodology for isolating Byzantine faults
• BFI uses formal system specification
• Improves fault tolerance without hurting throughput,
unlike increasing BFT group size
Contact Information
[email protected]
[email protected]
http://research.microsoft.com/farsite
Backup Slides
Farsite Spec Stats
• Semantic specification
– 1800 lines of TLA+
– 114 definitions
• Distributed-system specification
– 11,500 lines of TLA+
– 775 definitions
• Why so big?
– Windows file-system semantics are complex
– Scalability and strong consistency
– Byzantine fault isolation